**15min.lt scraper**
##Scraper will be divided in a few key areas and will be oriented to scrape the data from the news website 15min.lt. The main tool to scarpe the data from the website will be Newspaper3k
##Table of contnet
* 1. URL of news collection from 15min.lt
* 2. Data scraping from URLs
* 3. Data cleaning and preparation for fine-tuning

### URL of news collection from 15min.lt

The following code block will save URLs from www.15min.lt and save them to CSV file. It usually saves between 900-1000 URLs, however while repeaded daily about the half will be dublicates, so the actual number of new URLs are about 400-500. Depending of the project that you are working on, some of the articles will be videos or less then 150 words, so not all of them will be suitable. In order to collect the data for project it could take a couple of weeks.

In [None]:
import newspaper
import pandas as pd

news_site = newspaper.build("https://www.15min.lt/", language="lt")

print(f"Rasta straipsnių: {len(news_site.articles)}")

url_list = []

for article in news_site.articles:
    url_list.append(article.url)

for url in url_list[:5]:
    print(url)

df_urls = pd.DataFrame(url_list, columns=["url"])
df_urls.to_csv("15min_straipsniu_url.csv", index=False, encoding='utf-8-sig')

### Data scraping from URLs

## Installing packages

* newspaper3k - for articles crawling from URLs
* lxml[html_clean] - to help clean HTML (adverts, tags or other unwanted elements)
* lt_core_news_md - medium size Lithuanian spaCy model

In [None]:
!pip3 install newspaper3k
!pip install lxml[html_clean]
!python -m spacy download lt_core_news_md

### Import modules and functions
* urlparse - used to analize URL structure to extract categories and subcategories from URLs (newspaper3k had trouble extracting subcategories and were mixing them up with caegories)
* Article - used to extract article data from the URL
* re - used for HTML cleaning
* requests - for scanning web page content
* nltk - needed for newspaper3k nlp() method for sentence division
* spacy - used for text lemmatization


In [None]:
from urllib.parse import urlparse
import newspaper
from newspaper import Article
import pandas as pd
import re
import nltk
import requests
import nltk
nltk.download('punkt')
import spacy
from spacy.lang.lt.examples import sentences
spacy_model = spacy.load("lt_core_news_md")

### Stop words


In [None]:
!mkdir -p /usr/local/lib/python3.11/dist-packages/newspaper/resources/text/
!cp stopwords-lt.txt /usr/local/lib/python3.11/dist-packages/newspaper/resources/text/

### Data crawling
* Article text will be crawled in two different ways. The first one is mild oriented to normalize the text, leave all the punctuation and will be used for summarization. The other will be aggressive cleaning with lemmatization and will be used for clusterization

In [None]:
class TextProcessor:
    def __init__(self, url, stopwords):
        self.url = url
        self.stopwords = set(stopwords)
        self.nlp = spacy.load("lt_core_news_md", disable=["parser", "ner"])

        self.article = self.get_text(url)
        self.raw_text = self.article.text

        self.cleaned_for_summary = self.clean_text(self.raw_text, mode="mild")
        self.cleaned_for_clustering = self.clean_text(self.raw_text, mode="aggressive")

        main_cat, sub_cat = self.extract_categories_from_url()

        self.article_data = {
            "URL": self.url,
            "Title": self.article.title,
            "Main_category": main_cat,
            "Sub_category": sub_cat,
            "Text": self.raw_text,
            "Keywords": ', '.join(self.article.keywords) if self.article.keywords else 'No information',
            "Newspaper_summary": self.article.summary if self.article.summary else 'No summary available',
            "Cleaned_for_summary": self.cleaned_for_summary,
            "Cleaned_for_clustering": self.cleaned_for_clustering,
        }

    def get_text(self, url):
        article = Article(url, language='lt')
        article.download()
        article.parse()
        article.nlp()
        return article

"""In order to extract main cand subcategories the key word was "naujiena". There are 3 possible variations to extract categories
first one is when "naujiena" is right after 15min.lt/ in that case category will be first after that word, subcategory second.
Second case is when "naujiena" is in between categorie and subcategory. The last is when there is no "naujiena" at all."""
    def extract_categories_from_url(self):
        try:
            parts = urlparse(self.url).path.strip("/").split("/")

        # When /naujiena/kategorija/subkategorija
            if parts[0] == "naujiena" and len(parts) > 2:
                return parts[1].lower(), parts[2].lower()

        # When /kategorija/naujiena/subkategorija
            if "naujiena" in parts:
                i = parts.index("naujiena")
                main = parts[i - 1] if i > 0 else "Unknown"
                sub = parts[i + 1] if i + 1 < len(parts) else "Unknown"
                return main.lower(), sub.lower()

        # When /kategorija/subkategorija
            if len(parts) >= 2:
                return parts[0].lower(), parts[1].lower()

            return "Unknown", "Unknown"

        except Exception:
            return "Unknown", "Unknown"


    def clean_text(self, text, mode="mild"):
        # 1. Clean HTML and unwanted elements
        clean = re.sub(r'<[^>]+>', ' ', text)            # clean HTML/XML
        clean = re.sub(r'\s+', ' ', clean).strip()       # align the gap sequences and trim the ends

        if mode == "mild":
            # Mild text cleaning - normalizing quotes and dashes
            clean = clean.replace("“", "\"").replace("”", "\"")
            clean = clean.replace("–", "-")
            return clean

        elif mode == "aggressive":
            # Lowering the letters, delete punctuation and numbers, lemmatizing text
            clean = clean.lower()
            clean = re.sub(r'[^0-9a-ąčęėįšųūĄČĘĖĮŠŲŪžŽ ]+', ' ', clean)  # leaving only letters, numbers and spaces (including specific Lithuanian laguage letters)
            clean = re.sub(r'\s+', ' ', clean).strip()
            clean = re.sub(r'\d+', ' ', clean)
            clean = re.sub(r'\s+', ' ', clean).strip()
            # Lemmatizing text using spaCy
            doc = self.nlp(clean)
            lemmas = []
            for token in doc:
                lemma = token.lemma_
                if lemma and lemma not in self.stopwords and len(lemma) > 1:
                    lemmas.append(lemma)
            return " ".join(lemmas)

        else:
            raise ValueError(f"Unknown mode for text cleaing selected: {mode}. Please select 'mild' or 'aggressive'.")

In [None]:
path = "15min_straipsniu_url.csv"

def load_lt_stopwords(filepath="stopwords-lt.txt"):
    with open(filepath, "r", encoding="utf-8") as f:
        stopwords = {line.strip() for line in f if line.strip()}
    return stopwords


def read_urls_from_csv(path):
    df = pd.read_csv(path, header=None, names=["url"])
    unique_urls = df["url"].drop_duplicates().tolist()

    return unique_urls

def process_articles(url_list):
    processed_data = []

    for url in url_list:
      try:
        tp = TextProcessor(url, stopwords=lt_stopwords)
        processed_data.append(tp.article_data)
      except Exception as e:
          print(f"There is a problem with: {url} | Error: {e}")
          continue

    return processed_data

def save_to_csv(data_list, filename="apdoroti_straipsniai.csv"):
    df = pd.DataFrame(data_list)
    df.to_csv(filename, index=False, encoding = "utf-8-sig")
    print(f"Data was saved sucessfully to the file: {filename}")

### Using data crawler

In [None]:
lt_stopwords = load_lt_stopwords("stopwords-lt.txt")

# Change the name in acording to the filename
urls = read_urls_from_csv("15min_straipsniu_url.csv")

article_data_list = process_articles(urls)

save_to_csv(article_data_list, filename="apdoroti_straipsniai_1.csv")

### Data cleaning and preparation for fine-tuning
In this part all CSVs are combined into one, dublicates and articles with less than 150 words are removed (due to summarization training)

##Import modules and functions
* glob - to find all the CSVs
* os - to work with directories and folders

In [None]:
import pandas as pd
import numpy as np
import glob
import os

#Data preparation

In [None]:
def load_and_combine_csv_files(folder_path):
    all_files = glob.glob(os.path.join(folder_path, "*.csv"))
    dfs = [pd.read_csv(file, encoding="utf-8-sig") for file in all_files]
    combined_df = pd.concat(dfs, ignore_index=True)
    print(f"Total {len(dfs)} files, article number: {len(combined_df)}.")
    return combined_df

def remove_duplicates(df):
    before = len(df)
    df = df.drop_duplicates(subset=["Title", "Text"], keep="first")
    after = len(df)
    print(f"Removed {before - after} duplicates. Total {after} unique articles left.")
    return df

def filter_short_articles(df, min_word_count=150):
    df["word_count"] = df["Cleaned_for_summary"].apply(lambda x: len(str(x).split()))
    before = len(df)
    df = df[df["word_count"] > min_word_count]
    after = len(df)
    print(f"In total {before - after} too short articles were removed. {after} articles left.")
    df = df.drop(columns=["word_count"])
    return df

def mega_pipeline(folder_path, output_filename="sujungti_ir_apdoroti.csv", min_word_count=150):
    print("Cleaning data")

    df = load_and_combine_csv_files(folder_path)
    df = remove_duplicates(df)
    df = filter_short_articles(df, min_word_count=min_word_count)

    df.to_csv(output_filename, index=False, encoding="utf-8-sig")
    print(f"Files saved: {output_filename}")

    print("\nFinished")

source_folder = "/content"   # direction of CSVs
output_filename = "/content/sujungti_ir_apdoroti.csv"  # where to saved cleanded data

mega_pipeline(source_folder, output_filename)

##Category mapping
* due to small number or particular categories they were merged into a bigger pool

In [1]:
path = "sujungti_ir_apdoroti.csv"
df = pd.read_csv(path)
print(f"Total number of articles: {len(df)}")

category_mapping = {
    '24sek': 'sportas',
    'gazas': 'verslas',
    'lengvai': 'gyvenimas',
    'media-pasakojimai': 'gyvenimas',
    'multimedija': 'gyvenimas',
    'video': 'gyvenimas',
    'prenumerata': 'unknown'
}

df['Main_category'] = df['Main_category'].replace(category_mapping)

# Details about before and after the mapping

print("\n Categories before mapping:")
print(df['Main_category'].value_counts(dropna=False))

output_path = "sujungti_ir_apdoroti.csv"
df.to_csv(output_path, index=False, encoding="utf-8-sig")
print(f"\nData saved: {output_path}")

print("\n Categories after mapping:")
print(df['Main_category'].value_counts())

SyntaxError: incomplete input (<ipython-input-1-12b579da466c>, line 14)

### GPT4o - mini summaries
* for mBART fine-tuning GPT4o - mini were used to form two summaries for eatch article
* the prompt to generate summaries was "Pateik labai trumpą ir aiškią santrauką lietuvių kalba (1–2 sakiniai) šiam tekstui"

##Import modules
* openai- to send requests to OpenAI to generate text
* time - to set a pause not to overload OpenAI

In [None]:
import openai
import pandas as pd
import time

In [None]:
# unique API key
openai.api_key = "sk-proj"

df = pd.read_csv('sujungti_ir_apdoroti.csv')

# create columns for GPT summaries
df['GPT4o_summary_1'] = ""
df['GPT4o_summary_2'] = ""

# GPT4o summary generation function
def get_gpt4o_summary(text, temperature=0.7):
    prompt = f"Pateik labai trumpą ir aiškią santrauką lietuvių kalba (1–2 sakiniai) šiam tekstui:\n\n{text}\n\nSantrauka:"

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=100
    )

    return response.choices[0].message.content.strip()

for index, row in df.iterrows():
    print(f"Generating summaries {index+1}/{len(df)}...")

    text = row["Cleaned_for_summary"]

    summary_1 = get_gpt4o_summary(text, temperature=0.7)
    df.at[index, 'GPT4o_summary_1'] = summary_1
    time.sleep(1)

    summary_2 = get_gpt4o_summary(text, temperature=0.9)
    df.at[index, 'GPT4o_summary_2'] = summary_2
    time.sleep(1)

df.to_csv("straipsniai_su_gpt4o_santraukomis.csv", index=False, encoding='utf-8-sig')

print("Summaries are generated")

### Until this point data is prepared and cleaned. The following part will be oriented to prepare the data for mBART fine-tuning. Out of prepared data a new dataset will be formed consisting article (mild cleaned) text and GPT4o formed summaries. Every article will be dublicated and used with eatch GPT4o summary separately

In [None]:
df = pd.read_csv("/content/straipsniai_su_gpt4o_santraukomis.csv")

print(df.columns)

new_rows = []

for idx, row in df.iterrows():
    text = row['Cleaned_for_summary']

    summary1 = row['GPT4o_summary_1']
    summary2 = row['GPT4o_summary_2']

    if pd.notna(summary1) and summary1.strip():
        new_rows.append({"text": text, "summary": summary1})

    if pd.notna(summary2) and summary2.strip():
        new_rows.append({"text": text, "summary": summary2})

df_final = pd.DataFrame(new_rows)

df_final.to_csv("/content/straipsniai_final_versija.csv", index=False, encoding='utf-8-sig')

print(f"Total numer of articles: {len(df_final)}. File saved as straipsniai_final_versija.csv")


###Data set for mBART fine-tuning is prepared for train, validation and test (80/10/10)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv("/content/straipsniai_final_versija.csv")

train_df, temp_df = train_test_split(df, test_size=0.2, random_state=2025)

val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=2025)

print(f"Train: {len(train_df)} articles")
print(f"Validation: {len(val_df)} articles")
print(f"Test: {len(test_df)} articles")

train_df.to_csv("/content/train.csv", index=False, encoding="utf-8-sig")
val_df.to_csv("/content/val.csv", index=False, encoding="utf-8-sig")
test_df.to_csv("/content/test.csv", index=False, encoding="utf-8-sig")

print("Files sucessfully saved")