## goal:

Create a web application that takes news stories from specific sites and summarizes each article and categorizes article each into specific categories, presenting a summary and category to a user on the screen

Need to do:

1. **Web Scraping:**
   Use a Python library like Beautiful Soup and Requests or a specialized library like Scrapy to scrape news articles from specific websites. You'll need to identify the HTML structure of the articles on these sites and extract relevant information, such as the article content, title, and category.

2. **Text Summarization:**
   Implement text summarization using a library like Gensim, NLTK, or Hugging Face Transformers (for advanced models like BERT). The summarization process should take the full article text and produce a shorter summary.

3. **Text Classification:**
   For categorization, you'll need a machine learning model for text classification. You can train your own model using a dataset of categorized articles, or you can use pre-trained models like those from the Hugging Face Transformers library. The model will take the article text and categorize it into specific categories.

4. **Web Framework:**
   Choose a web framework for building your web application. Flask and Django are popular options in Python. Set up your application with routes for user interaction and displaying results.

5. **Database (Optional):**
   You can store scraped articles, summaries, and categories in a database for better management and retrieval.

6. **User Interface:**
   Design and develop the user interface to present the summarized articles and their categories to users. You can use HTML, CSS, and JavaScript for the front end. Libraries like Bootstrap can help with styling.

7. **Integration:**
   Integrate the web scraping, text summarization, and text classification components into your web application. When a user requests news, the application should scrape articles, summarize them, and classify them in real-time.

8. **User Interaction:**
   Implement user interaction features to allow users to request news from specific sites, view article summaries, and see categorized results.

9. **Deployment:**
   Deploy your web application on a server. You can use platforms like Heroku, AWS, or a VPS to host your application and make it accessible on the web.

10. **Testing and Maintenance:**
    Thoroughly test your web application to ensure it works as expected. Monitor for issues and maintain the application over time.

based on this, this notebook should have the web scraping, storing scraped data into a database, and the model to generate text summaries. Django part to be completed later.


Here's a list of free news websites that provide global news:

1. **BBC News** - [https://www.bbc.com/news](https://www.bbc.com/news)

2. **CNN** - [https://www.cnn.com/](https://www.cnn.com/)

3. **Al Jazeera** - [https://www.aljazeera.com/](https://www.aljazeera.com/)

4. **Reuters** - [https://www.reuters.com/](https://www.reuters.com/)

5. **NPR** - [https://www.npr.org/](https://www.npr.org/)

6. **The Guardian** - [https://www.theguardian.com/](https://www.theguardian.com/)

7. **The New York Times** - [https://www.nytimes.com/](https://www.nytimes.com/)

8. **BBC World Service** - [https://www.bbc.co.uk/worldserviceradio](https://www.bbc.co.uk/worldserviceradio)

9. **Bloomberg** - [https://www.bloomberg.com/](https://www.bloomberg.com/)

10. **AP News** - [https://apnews.com/](https://apnews.com/)

These websites provide free access to a wide range of news articles covering global events and topics. Please note that the availability of content and access may vary by region, and some websites may offer premium content or subscription options alongside their free offerings.

In [None]:
# Building the scraper

In [None]:
#!pip install BeautifulSoup4

In [2]:
# !pip install nltk
# !pip install newspaper3k



In [8]:
#!pip install pysummarization
#!pip install contractions
#!pip install unicodedata
#!pip install regex

ERROR: Could not find a version that satisfies the requirement unicodedata (from versions: none)
ERROR: No matching distribution found for unicodedata


#### Getting the URLs

In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse, urlunparse

url = "https://www.bbc.co.uk/news"

response = requests.get(url)

# Check if the request is successful
if response.status_code == 200:
    # get content
    soup = bs(response.text, "html.parser")

    # find all anchor elements with the "gs-c-promo-heading" class
    article_links = soup.find_all("a", class_="gs-c-promo-heading")

    # Create list to store the article URLs
    article_urls = []

    # Extract article URLs from the href attribute and add to the list
    for link in article_links:
        article_url = "https://www.bbc.com" + link.get("href")
        article_urls.append(article_url)

    # Clean up URLs with duplicate "https://www.bbc.com"
    cleaned_urls = []
    for url in article_urls:
        # Check if "https" appears twice in the URL
        if url.count("https") == 2:
            # Remove the first occurrence of "https://www.bbc.com"
            cleaned_url = url.replace("https://www.bbc.com", "", 1)
            cleaned_urls.append(cleaned_url)
        else:
            cleaned_urls.append(url)

    # Print the cleaned list of article URLs
    for url in cleaned_urls:
        print(url)

else:
    print("Failed to retrieve the page.")


https://www.bbc.com/news/live/world-middle-east-67831997
https://www.bbc.com/news/world-middle-east-67831478
https://www.bbc.com/news/world-europe-67827443
https://www.bbc.com/news/world-us-canada-67833339
https://www.bbc.com/news/entertainment-arts-67832513
https://www.bbc.com/news/science-environment-67718719
https://www.bbc.com/news/world-europe-67826487
https://www.bbc.com/news/world-us-canada-67830918
https://www.bbc.com/news/world-us-canada-67832595
https://www.bbc.com/news/world-us-canada-67835961
https://www.bbc.com/news/world-us-canada-67832651
https://www.bbc.com/news/world-us-canada-67835961
https://www.bbc.com/news/world-us-canada-67832651
https://www.bbc.com/news/world-latin-america-67826945
https://www.bbc.com/news/world-us-canada-67829682
https://www.bbc.com/news/world-us-canada-67834817
https://www.bbc.com/news/world-latin-america-66786995
https://www.bbc.com/news/world_radio_and_tv
https://www.bbc.com/sounds/play/live:bbc_world_service
https://www.bbc.com/news/world-as

#### Extracting Article Information

In [6]:
from newspaper import Article
from datetime import datetime

# Function to extract information from a URL
def extract_info(url):
    article = Article(url)
    article.download()
    article.parse()

    # Extracted information
    title = article.title
    text = article.text
    authors = article.authors if article.authors else ["Author not found"]

    # Generate summary
    article.nlp()
    summary = article.summary

    return {
        "URL": url,
        "Title": title,
        "Authors": authors,
        "Text": text,
        "TF_IDF_Summary": summary
    }

# Extract information for each cleaned URL in the list
results = [extract_info(url) for url in cleaned_urls]

# Create a Pandas DataFrame from the results
df = pd.DataFrame(results)

# Get the current date in the format YYYY-MM-DD
current_date = datetime.now().strftime("%Y-%m-%d")

# change the path to save the files for organization
directory_path = r"C:\Users\jessi\Desktop\Projects Personal\bbc_article_extracts"

# Combine the directory path with the CSV filename
csv_filename = f"{directory_path}\\bbc_articles_info_{current_date}.csv"

# Export the DataFrame to a CSV file with the updated path
df.to_csv(csv_filename, index=False)

In [None]:
# Note, the summaries created using the newspaper package use the tf-idf method for summarization
# Will try to use the BERT model for summaries

# note for this, look at huggingface transformers library, 

In [15]:
#USE THIS TO CHECK DF IN EXCEL
# Get the current date in the format YYYY-MM-DD
current_date = datetime.now().strftime("%Y-%m-%d")

# Export the DataFrame to a CSV file with the date in the name
csv_filename = f"bbc_articles_info_testing_{current_date}.csv"
df.to_csv(csv_filename, index=False)

#### Cleaning the Data/Preprocessing

In [3]:
# Expanding contractions

import contractions
import unicodedata
import re

# Function to expand contractions in a given text
def expand_contractions(text):
    return contractions.fix(text)

# Apply the function to the 'Text' column and create a new column for cleaned text
df['clean_article_text'] = df['Text'].apply(expand_contractions)


# Removing unicode characters
def remove_unicode(text):
    return ''.join(char for char in unicodedata.normalize('NFD', text) if unicodedata.category(char) != 'Mn')

# Apply this to the column we just made (will be keeping clean text/updates in one column)
df['clean_article_text'] = df['clean_article_text'].apply(remove_unicode)


# Converting to lower case
df['clean_article_text'] = df['clean_article_text'].str.lower()

# Removing special characters and punctuation
def remove_special_characters(text):
    # Using regex to remove non-alphanumeric characters
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Apply the function to the 'clean_article_text' column and update it
df['clean_article_text'] = df['clean_article_text'].apply(remove_special_characters)


In [4]:
#### testing this
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download punkt tokenizer
nltk.download('punkt')

# Download NLTK stop words data
nltk.download('stopwords')

# Function to convert list to string, tokenize, and remove stop words
def preprocess_text(text_list):
    text_string = ' '.join(text_list)
    tokens = word_tokenize(text_string)
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Convert 'clean_article_text' column from list to string and apply preprocessing
df['clean_article_text'] = df['clean_article_text'].apply(preprocess_text)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# using NLTK to do further cleaning
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# Tokenization

# Download punkt tokenizer
nltk.download('punkt')

# Apply tokenization to column in df
df['clean_article_text'] = df['clean_article_text'].apply(word_tokenize)

# Stop word removal

# Download NLTK stop words data
nltk.download('stopwords')

# Function to remove stop words
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Apply stop words removal to column in df
df['clean_article_text'] = df['clean_article_text'].apply(remove_stopwords)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TypeError: expected string or bytes-like object, got 'list'

In [None]:
#### Using Pysummarization
# refer to the documentation online

In [None]:
#### Using BERT via huggingface transformers (or use T5) check the link
#  https://keras.io/examples/nlp/t5_hf_summarization/       this uses T5

# https://datagraphi.com/blog/post/2021/9/24/comparing-performance-of-a-modern-nlp-framework-bert-vs-a-classical-approach-tf-idf-for-document-classification-with-simple-and-easy-to-understand-code
# ^ this uses BERT

In [None]:
### for bert

!pip install transformers torch

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")
train_dataset = dataset['train']
test_dataset = dataset['test']

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_batch(batch):
    return tokenizer(batch['article'], padding=True, truncation=True, max_length=512)
    
train_dataset = train_dataset.map(tokenize_batch, batched=True)
test_dataset = test_dataset.map(tokenize_batch, batched=True)

import torch
from transformers import BertForSeq2Seq

model = BertForSeq2Seq.from_pretrained('bert-base-uncased')

# Define optimizer and scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def compute_metrics(pred):
    # Define your evaluation metric (e.g., BLEU, ROUGE, etc.)
    # For this example, we'll use dummy metrics
    return {"accuracy": 0.5}

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    evaluation_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    logging_dir='./logs',
    do_train=True,
    do_eval=True,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    num_train_epochs=3,
    learning_rate=1e-5,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]), 
                                'attention_mask': torch.stack([f[1] for f in data]), 
                                'labels': torch.stack([f[2] for f in data])},
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

results = trainer.evaluate()
print(results)


In [None]:
# ## bert -> prolly drop this part, above is using cnn dataset to train model and eval.
# from transformers import BertTokenizer, BertForSequenceClassification, AdamW
# import torch

# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForConditionalGeneration.from_pretrained('bert-base-uncased')

# # Extract cleaned text data from the dataframe
# texts = df['cleaned_article_text'].tolist()

# # Tokenize the texts
# inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# # Generate summaries
# outputs = model.generate(inputs.input_ids, max_length=150, num_beams=2, early_stopping=True)

# # Decode the generated summaries
# generated_summaries = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# # store the summaries
# df['Bert_generated_summary'] = generated_summaries


In [None]:
## bert after trained/tuned on my own data
# Tokenizing the dataframe
def tokenize_articles(text):
    inputs = tokenizer(text, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
    return inputs

df['tokenized'] = df['cleaned_article_text'].apply(tokenize_articles)

def generate_summary(input_ids, attention_mask):
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    
    # Generate summary
    with torch.no_grad():
        outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

df['generated_summary'] = df['tokenized'].apply(lambda x: generate_summary(x['input_ids'], x['attention_mask']))
