## goal:

Create a web application that takes news stories from specific sites and summarizes each article and categorizes article each into specific categories, presenting a summary and category to a user on the screen

Need to do:

1. **Web Scraping:**
   Use a Python library like Beautiful Soup and Requests or a specialized library like Scrapy to scrape news articles from specific websites. You'll need to identify the HTML structure of the articles on these sites and extract relevant information, such as the article content, title, and category.

2. **Text Summarization:**
   Implement text summarization using a library like Gensim, NLTK, or Hugging Face Transformers (for advanced models like BERT). The summarization process should take the full article text and produce a shorter summary.

3. **Text Classification:**
   For categorization, you'll need a machine learning model for text classification. You can train your own model using a dataset of categorized articles, or you can use pre-trained models like those from the Hugging Face Transformers library. The model will take the article text and categorize it into specific categories.

4. **Web Framework:**
   Choose a web framework for building your web application. Flask and Django are popular options in Python. Set up your application with routes for user interaction and displaying results.

5. **Database (Optional):**
   You can store scraped articles, summaries, and categories in a database for better management and retrieval.

6. **User Interface:**
   Design and develop the user interface to present the summarized articles and their categories to users. You can use HTML, CSS, and JavaScript for the front end. Libraries like Bootstrap can help with styling.

7. **Integration:**
   Integrate the web scraping, text summarization, and text classification components into your web application. When a user requests news, the application should scrape articles, summarize them, and classify them in real-time.

8. **User Interaction:**
   Implement user interaction features to allow users to request news from specific sites, view article summaries, and see categorized results.

9. **Deployment:**
   Deploy your web application on a server. You can use platforms like Heroku, AWS, or a VPS to host your application and make it accessible on the web.

10. **Testing and Maintenance:**
    Thoroughly test your web application to ensure it works as expected. Monitor for issues and maintain the application over time.

based on this, this notebook should have the web scraping, storing scraped data into a database, and the model to generate text summaries. Django part to be completed later.


Here's a list of free news websites that provide global news:

1. **BBC News** - [https://www.bbc.com/news](https://www.bbc.com/news)

2. **CNN** - [https://www.cnn.com/](https://www.cnn.com/)

3. **Al Jazeera** - [https://www.aljazeera.com/](https://www.aljazeera.com/)

4. **Reuters** - [https://www.reuters.com/](https://www.reuters.com/)

5. **NPR** - [https://www.npr.org/](https://www.npr.org/)

6. **The Guardian** - [https://www.theguardian.com/](https://www.theguardian.com/)

7. **The New York Times** - [https://www.nytimes.com/](https://www.nytimes.com/)

8. **BBC World Service** - [https://www.bbc.co.uk/worldserviceradio](https://www.bbc.co.uk/worldserviceradio)

9. **Bloomberg** - [https://www.bloomberg.com/](https://www.bloomberg.com/)

10. **AP News** - [https://apnews.com/](https://apnews.com/)

These websites provide free access to a wide range of news articles covering global events and topics. Please note that the availability of content and access may vary by region, and some websites may offer premium content or subscription options alongside their free offerings.

In [None]:
# Building the scraper

In [None]:
#!pip install BeautifulSoup4

In [None]:
# !pip install nltk
# !pip install newspaper3k

In [None]:
#!pip install torch transformers

In [None]:
#!pip install pysummarization
#!pip install contractions
#!pip install unicodedata
#!pip install regex
#!pip install torch transformers
#!pip install datasets
# !pip install SentencePiece
# !pip install evaluate
# !pip install rouge_scor
#!pip install accelerate -U
#!pip install tensorboard

#### Getting the URLs

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse, urlunparse

url = "https://www.bbc.co.uk/news"

response = requests.get(url)

# Check if the request is successful
if response.status_code == 200:
    # get content
    soup = bs(response.text, "html.parser")

    # find all anchor elements with the "gs-c-promo-heading" class
    article_links = soup.find_all("a", class_="gs-c-promo-heading")

    # Create list to store the article URLs
    article_urls = []

    # Extract article URLs from the href attribute and add to the list
    for link in article_links:
        article_url = "https://www.bbc.com" + link.get("href")
        article_urls.append(article_url)

    # Clean up URLs with duplicate "https://www.bbc.com"
    cleaned_urls = []
    for url in article_urls:
        # Check if "https" appears twice in the URL
        if url.count("https") == 2:
            # Remove the first occurrence of "https://www.bbc.com"
            cleaned_url = url.replace("https://www.bbc.com", "", 1)
            cleaned_urls.append(cleaned_url)
        else:
            cleaned_urls.append(url)

    # Print the cleaned list of article URLs
    for url in cleaned_urls:
        print(url)

else:
    print("Failed to retrieve the page.")


https://www.bbc.com/news/world-us-canada-68269354
https://www.bbc.co.uk/sport/american-football/live/ceqj69d5y8yt
https://www.bbc.com/news/world-us-canada-68269413
https://www.bbc.com/news/entertainment-arts-68238272
https://www.bbc.com/news/world-us-canada-68270748
https://www.bbc.com/news/world-middle-east-68269957
https://www.bbc.com/news/world-africa-68270866
https://www.bbc.com/sport/football/68196261
https://www.bbc.com/news/world-us-canada-68268817
https://www.bbc.com/news/world-asia-68266845
https://www.bbc.com/news/world-asia-68262751
https://www.bbc.com/news/world-asia-68266845
https://www.bbc.com/news/world-asia-68262751
https://www.bbc.com/news/world-68266846
https://www.bbc.com/news/world-africa-68255614
https://www.bbc.com/news/world-latin-america-68268257
https://www.bbc.com/sport/american-football/68201059
https://www.bbc.com/sport/american-football/68204783
https://www.bbc.com/sport/american-football/68250146
https://www.bbc.com/sport/american-football/68204790
https:/

#### Extracting Article Information

In [2]:
from newspaper import Article
from datetime import datetime

# Function to extract information from a URL
def extract_info(url):
    article = Article(url)
    article.download()
    article.parse()

    # Extracted information
    title = article.title
    text = article.text
    authors = article.authors if article.authors else ["Author not found"]

    # Generate summary
    article.nlp()
    summary = article.summary

    return {
        "URL": url,
        "Title": title,
        "Authors": authors,
        "Text": text,
        "TF_IDF_Summary": summary
    }

# Extract information for each cleaned URL in the list
results = [extract_info(url) for url in cleaned_urls]

# Create a Pandas DataFrame from the results
df = pd.DataFrame(results)

# Get the current date in the format YYYY-MM-DD
current_date = datetime.now().strftime("%Y-%m-%d")

# change the path to save the files for organization
directory_path = r"C:\Users\jessi\Desktop\Projects Personal\bbc_article_extracts"

# Combine the directory path with the CSV filename
csv_filename = f"{directory_path}\\bbc_articles_info_{current_date}.csv"

# Export the DataFrame to a CSV file with the updated path
df.to_csv(csv_filename, index=False)

In [None]:
#USE THIS TO CHECK DF IN EXCEL
# Get the current date in the format YYYY-MM-DD
current_date = datetime.now().strftime("%Y-%m-%d")

# Export the DataFrame to a CSV file with the date in the name
csv_filename = f"bbc_articles_info_testing_{current_date}.csv"
df.to_csv(csv_filename, index=False)

#### Cleaning the Data/Preprocessing

In [15]:
# Expanding contractions

import contractions
import unicodedata
import re

# Function to expand contractions in a given text
def expand_contractions(text):
    return contractions.fix(text)

# Apply the function to the 'Text' column and create a new column for cleaned text
df['clean_article_text'] = df['Text'].apply(expand_contractions)


# Removing unicode characters
def remove_unicode(text):
    return ''.join(char for char in unicodedata.normalize('NFD', text) if unicodedata.category(char) != 'Mn')

# Apply this to the column we just made (will be keeping clean text/updates in one column)
df['clean_article_text'] = df['clean_article_text'].apply(remove_unicode)


# Converting to lower case
df['clean_article_text'] = df['clean_article_text'].str.lower()

# Removing special characters and punctuation
def remove_special_characters(text):
    # Using regex to remove non-alphanumeric characters
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Apply the function to the 'clean_article_text' column and update it
df['clean_article_text'] = df['clean_article_text'].apply(remove_special_characters)


In [None]:
## up until here, the df looks okay. need to tokenize, remove stop words, then lemmatize. the below messes up the DF

In [None]:
# # using NLTK to do further cleaning
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
# from nltk.stem import WordNetLemmatizer

# # Tokenization

# # Download punkt tokenizer
# nltk.download('punkt')

# # convert the column into a string (or else below doesn't work)
# df['clean_article_text'] = df['clean_article_text'].astype(str)

# # Apply tokenization to column in df
# #df['clean_article_text'] = df['clean_article_text'].apply(word_tokenize)

# # Stop word removal

# # Download NLTK stop words data
# nltk.download('stopwords')

# # Function to remove stop words
# def remove_stopwords(clean_article_text):
#     stop_words = set(stopwords.words('english'))
#     words = nltk.word_tokenize(clean_article_text)
#     filtered_words = [word for word in words if word.lower() not in stop_words]
#     return ' '.join(filtered_words)

# # convert column to string again (it changes after the above)
# df['clean_article_text']= df['clean_article_text'].apply(str)

# # Apply stop words removal to column in df
# df['clean_article_text'] = df['clean_article_text'].apply(remove_stopwords)

# # Lemmatization

# # Download WordNet lemmatizer
# nltk.download('wordnet')

# # Initialize the lemmatizer
# lemmatizer = WordNetLemmatizer()

# # Function to perform lemmatization
# def lemmatize_text(clean_article_text):
#     #words = nltk.word_tokenize(clean_article_text)
#     lemmatized_words = [lemmatizer.lemmatize(filtered_words) for filtered_words in words]
#     return ' '.join(lemmatized_words)

# # convert column to string again (it changes after the above)
# df['clean_article_text']= df['clean_article_text'].apply(str)

# # Apply lemmatization to column in df
# df['clean_article_text'] = df['clean_article_text'].apply(lemmatize_text)


In [None]:
#T5 below

In [3]:
import torch
import numpy as np
#from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  # Updated import
from transformers import T5Tokenizer, T5ForConditionalGeneration

#tokenizer = AutoTokenizer.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')
#model = AutoModelForSeq2SeqLM.from_pretrained('t5-base', return_dict=True)  # Updated model initialization
model = T5ForConditionalGeneration.from_pretrained('t5-base')


#setting up text to be handled by T5

# Create an empty column for the T5 summaries and then save
df['T5_Summary'] = ''

# Loop through each article and generate summary
for index, row in df.iterrows():
    text = row['Text']
    
    inputs = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=512, truncation=True)
    output = model.generate(inputs, min_length=80, max_length=100)
    summary = tokenizer.decode(output[0])
    
    df.at[index, 'T5_Summary'] = summary


  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/hugg

In [None]:
# import evaluate
# import sentencepiece

# from datasets import load_dataset

# bbc_data = load_dataset('gopalkalpande/bbc-news-summary', split='train')

# full_data = bbc_data.train_test_split(test_size=0.2,shuffle=True)
# train_data = full_data['train']
# valid_data = full_data['test']

# print(train_data)
# print(valid_data)



In [None]:
# from transformers import T5Tokenizer

# tokenizer = T5Tokenizer.from_pretrained('t5-base')


# def preprocess_function(examples):
#     inputs = [f"summarize: {article}" for article in examples['Articles']]
#     model_inputs = tokenizer(
#         inputs,
#         max_length=512,
#         truncation=True,
#         padding='max_length'
#     )
 
#     # Set up the tokenizer for targets
#     targets = [summary for summary in examples['Summaries']]
#     with tokenizer.as_target_tokenizer():
#         labels = tokenizer(
#             targets,
#             max_length=512,
#             truncation=True,
#             padding='max_length'
#         )
 
#     model_inputs["labels"] = labels["input_ids"]
#     return model_inputs

# # Apply the function to the whole dataset
# tokenized_train = train_data.map(
#     preprocess_function,
#     batched=True,
#     num_proc=3
# )
# tokenized_valid = valid_data.map(
#     preprocess_function,
#     batched=True,
#     num_proc=3
# )
 

In [None]:
# import torch
# from datasets import load_dataset
# from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

# # Load the dataset
# bbc_data = load_dataset('gopalkalpande/bbc-news-summary', split='train')

# # Split the dataset into train and test sets
# full_data = bbc_data.train_test_split(test_size=0.2, shuffle=True)
# train_data = full_data['train']
# test_data = full_data['test']

# # Initialize the tokenizer in the global scope
# tokenizer = T5Tokenizer.from_pretrained('t5-base')

# def preprocess_function(examples, tokenizer):
#     inputs = [f"summarize: {article}" for article in examples['Articles']]
#     model_inputs = tokenizer(
#         inputs,
#         max_length=512,
#         truncation=True,
#         padding='max_length'
#     )
 
#     # Set up the tokenizer for targets
#     targets = [summary for summary in examples['Summaries']]
#     with tokenizer.as_target_tokenizer():
#         labels = tokenizer(
#             targets,
#             max_length=512,
#             truncation=True,
#             padding='max_length'
#         )
 
#     model_inputs["labels"] = labels["input_ids"]
#     return model_inputs

# # Apply the function to the whole dataset
# tokenized_train = train_data.map(
#     preprocess_function,
#     batched=True,
#     num_proc=3,
#     fn_kwargs={'tokenizer': tokenizer}
# )
# tokenized_test = test_data.map(
#     preprocess_function,
#     batched=True,
#     num_proc=3,
#     fn_kwargs={'tokenizer': tokenizer}
# )

# # Fine-tune the model
# training_args = Seq2SeqTrainingArguments(
#     output_dir="./results",
#     per_device_train_batch_size=8,
#     per_device_eval_batch_size=8,
#     predict_with_generate=True,
#     evaluation_strategy="steps",
#     save_steps=500,
#     eval_steps=500,
#     logging_steps=500,
#     logging_dir='./logs',
#     do_train=True,
#     do_eval=True,
#     load_best_model_at_end=True,
#     metric_for_best_model="rouge",
#     greater_is_better=True,
#     num_train_epochs=3,
#     learning_rate=1e-5,
# )

# trainer = Seq2SeqTrainer(
#     model=model,
#     args=training_args,
#     data_collator = lambda data: {
#     'input_ids': torch.stack([item['input_ids'] for item in data]),
#     'attention_mask': torch.stack([item['attention_mask'] for item in data]),
#     'labels': torch.stack([item['labels'] for item in data])},
#     train_dataset=tokenized_train,
#     eval_dataset=tokenized_test,
#     tokenizer=tokenizer,
# )


# # Train the model
# trainer.train()

# # Evaluate the model using ROUGE score
# rouge = Rouge()
# eval_results = trainer.predict(tokenized_test)
# references = [" ".join(ex['Summaries']) for ex in test_data]
# predictions = [" ".join(rouge_output[0]["summary_text"].split()) for rouge_output in eval_results.predictions]
# rouge_scores = rouge.get_scores(predictions, references, avg=True)

# print("ROUGE Scores:", rouge_scores)

# # Use the fine-tuned model to create summaries for your own data
# # Assuming df is your DataFrame with the 'Text' column
# df['FineTuned_Summary'] = ''

# for index, row in df.iterrows():
#     text = row['Text']
    
#     inputs = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=512, truncation=True)
#     output = model.generate(inputs, min_length=80, max_length=100)
#     summary = tokenizer.decode(output[0])
    
#     df.at[index, 'FineTuned_Summary'] = summary


In [4]:
# import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

# Load the dataset
bbc_data = load_dataset('gopalkalpande/bbc-news-summary', split='train')

# Split the dataset into train and test sets
full_data = bbc_data.train_test_split(test_size=0.2, shuffle=True)
train_data = full_data['train']
test_data = full_data['test']

# Initialize the tokenizer in the global scope
tokenizer = T5Tokenizer.from_pretrained('t5-base')

def preprocess_function(examples, tokenizer):
    inputs = [f"summarize: {article}" for article in examples['Articles']]
    model_inputs = tokenizer(
        inputs,
        max_length=512,
        truncation=True,
        padding='max_length'
    )
 
    # Set up the tokenizer for targets
    targets = [summary for summary in examples['Summaries']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=512,
            truncation=True,
            padding='max_length'
        )
 
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the function to the whole dataset
tokenized_train = train_data.map(
    preprocess_function,
    batched=True,
    num_proc=3,
    fn_kwargs={'tokenizer': tokenizer}
)
tokenized_test = test_data.map(
    preprocess_function,
    batched=True,
    num_proc=3,
    fn_kwargs={'tokenizer': tokenizer}
)

model = T5ForConditionalGeneration.from_pretrained('t5-base')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Map (num_proc=3): 100%|█████████████████████████████████████████████████████| 1779/1779 [00:35<00:00, 50.15 examples/s]
Map (num_proc=3): 100%|███████████████████████████████████████████████████████| 445/445 [00:25<00:00, 17.58 examples/s]


222,903,552 total parameters.
222,903,552 training parameters.


In [5]:
from datasets import load_metric
rouge = load_metric("rouge")
 
def compute_metrics(eval_pred):
    predictions, labels = eval_pred.predictions[0], eval_pred.label_ids
 
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
 
    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
        rouge_types=[
            'rouge1',
            'rouge2',
            'rougeL'
        ]
    )
 
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
 
    return {k: round(v, 4) for k, v in result.items()}

def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak.
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [7]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='results_t5base',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='results_t5base',
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=200,
    save_strategy='epoch',
    save_total_limit=2,
    report_to='tensorboard',
    learning_rate=0.0001,
    dataloader_num_workers=2,
    fp16=True,
    gradient_accumulation_steps=6

)
 
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    compute_metrics=compute_metrics
)
 
history = trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 7.06 GiB is allocated by PyTorch, and 45.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# from transformers import TrainingArguments, Trainer

# # Assuming you have defined 'model', 'tokenized_train', 'tokenized_test', and 'compute_metrics' somewhere in your code

# training_args = TrainingArguments(
#     output_dir='results_t5base',
#     num_train_epochs=3,
#     per_device_train_batch_size=2,  # Adjust according to available CPU memory
#     per_device_eval_batch_size=2,   # Adjust according to available CPU memory
#     warmup_steps=500,
#     weight_decay=0.01,
#     logging_dir='results_t5base',
#     logging_steps=10,
#     evaluation_strategy='steps',
#     eval_steps=200,
#     save_strategy='epoch',
#     save_total_limit=2,
#     report_to='tensorboard',
#     learning_rate=0.00005,  # Experiment with different learning rates
#     dataloader_num_workers=2,  # Adjust according to your system's capacity
#     gradient_accumulation_steps=1,
#     fp16=False,
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=tokenized_train,
#     eval_dataset=tokenized_test,
#     compute_metrics=compute_metrics
# )

# history = trainer.train()


In [None]:
# fine tune t5 on bbc summary dataset
# evaluate it, and then run again

In [None]:
#### Using Pysummarization
# refer to the documentation online

In [None]:
#### Using BERT via huggingface transformers (or use T5) check the link
#  https://keras.io/examples/nlp/t5_hf_summarization/       this uses T5

# https://datagraphi.com/blog/post/2021/9/24/comparing-performance-of-a-modern-nlp-framework-bert-vs-a-classical-approach-tf-idf-for-document-classification-with-simple-and-easy-to-understand-code
# ^ this uses BERT

In [None]:
# from datasets import load_dataset

# dataset = load_dataset("cnn_dailymail", "3.0.0")
# train_dataset = dataset['train']
# test_dataset = dataset['test']

# from transformers import BertTokenizer

# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# def tokenize_batch(batch):
#     return tokenizer(batch['article'], padding=True, truncation=True, max_length=512)
    
# train_dataset = train_dataset.map(tokenize_batch, batched=True)
# test_dataset = test_dataset.map(tokenize_batch, batched=True)

# import torch
# from transformers import BertForSeq2Seq

# model = BertForSeq2Seq.from_pretrained('bert-base-uncased')

# # Define optimizer and scheduler
# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)

# def compute_metrics(pred):
#     # Define your evaluation metric (e.g., BLEU, ROUGE, etc.)
#     # For this example, we'll use dummy metrics
#     return {"accuracy": 0.5}

# from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# training_args = Seq2SeqTrainingArguments(
#     output_dir="./results",
#     per_device_train_batch_size=2,
#     per_device_eval_batch_size=2,
#     predict_with_generate=True,
#     evaluation_strategy="steps",
#     save_steps=500,
#     eval_steps=500,
#     logging_steps=500,
#     logging_dir='./logs',
#     do_train=True,
#     do_eval=True,
#     load_best_model_at_end=True,
#     metric_for_best_model="accuracy",
#     greater_is_better=True,
#     num_train_epochs=3,
#     learning_rate=1e-5,
# )

# trainer = Seq2SeqTrainer(
#     model=model,
#     args=training_args,
#     data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]), 
#                                 'attention_mask': torch.stack([f[1] for f in data]), 
#                                 'labels': torch.stack([f[2] for f in data])},
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     tokenizer=tokenizer,
#     compute_metrics=compute_metrics
# )

# trainer.train()

# results = trainer.evaluate()
# print(results)


In [None]:
# ## bert after trained/tuned on my own data
# # Tokenizing the dataframe
# def tokenize_articles(text):
#     inputs = tokenizer(text, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
#     return inputs

# df['tokenized'] = df['cleaned_article_text'].apply(tokenize_articles)

# def generate_summary(input_ids, attention_mask):
#     input_ids = input_ids.to(device)
#     attention_mask = attention_mask.to(device)
    
#     # Generate summary
#     with torch.no_grad():
#         outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)
    
#     summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
#     return summary

# df['generated_summary'] = df['tokenized'].apply(lambda x: generate_summary(x['input_ids'], x['attention_mask']))


In [1]:
# import torch

# # Check if GPU is available for PyTorch
# is_gpu_available = torch.cuda.is_available()
# print(f"GPU available for PyTorch: {is_gpu_available}")


GPU available for PyTorch: True
