# Project for Computational Linguistics

# 1. Introduction
For this project, I wanted to check the news titles from the sport.pl website regarding Polish tennis player Iga Świątek, whether they were positive, negative or perhaps neutral. The reason I wanted to check these titles is because Iga supporters believe that the titles are mostly negative in order to stir up emotions and make people want to click on the news. Iga's supporters believe that such actions are unnecessary because Iga is the second tennis player in the world (she was previously ranked number one for a long time) and she is well-known enough that it is not necessary to use such actions to make people click on these news stories. Also they think that sports portals like sport.pl should write about their subject matter reliably and not like tabloids.

# 2. Data
Data for this project will be taken from the first five pages about iga swiatek on sport.pl (https://www.sport.pl/iga-swiatek, https://www.sport.pl/iga-swiatek/2, https://www.sport.pl/iga-swiatek/3, https://www.sport.pl/iga-swiatek/4, https://www.sport.pl/iga-swiatek/5). The reason for analyzing specifically such a number of pages is the issue that 200 recent titles is a suitable research sample for me. In turn, the choice of this site is due to the fact that among Iga's fans this site is considered clickbait, and sport.pl on the other hand is one of the biggest sports sites in Poland (https://media-panel.pl/pl/aktualnosci/zestawienia-tematyczne-i-funkcjonalne-grudzien-2024/). 

# 3. Code
## 3.1. Imports and configuration
The code used in this project starts with the import and configuration of all the functions that will be needed in the code.


In [8]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from collections import Counter
from tabulate import tabulate
import random

## 3.2. NLTK resources
I then downloaded the NLTK resource detailing the punkt package, which includes a tokeniser to split the text into words.

In [None]:
nltk.download("punkt")

## 3.3. Function: load_custom_stopwords
I used this function to load a list of mine stopwords (i.e. words ignored in the analysis, e.g. in Polish they are: ‘i’, “ale”, “lub”), because the default stopwords list was in English and the code then didn't work so I had to generate my stopwords list in Polish (polish.stopwords.txt). The elimination of stopwords will help in a more specific analysis of titles from web pages. If the file does not exist, the function returns an empty set.


In [10]:
def load_custom_stopwords(file_path):
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            custom_stopwords = {line.strip().lower() for line in file if line.strip()}
            return custom_stopwords
    except FileNotFoundError:
        print(f"File {file_path} not found. Please make sure the file exists.")
        return set()
STOPWORDS = load_custom_stopwords("polish.stopwords.txt")

## 3.4. Function: get_titles_from_website
I use this function to retrieve a web page using requests.get, also checking that the server response (HTTP status) is valid (code 200). BeautifulSoup, on the other hand, is used to analyse the HTML code. Afterwards, it searches for all elements marked as h2 and extracts text from them in the form of titles, and finally returns a list of titles.

In [11]:
def get_titles_from_website(url, css_selector="h2"):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception(f"Failed to connect to the URL: {url}")
    
    soup = BeautifulSoup(response.text, 'html.parser')
    titles = [element.text.strip() for element in soup.select(css_selector)]
    return titles

## 3.5. Function: preprocess_titles(titles, sentiment)
This function was used to prepare the input data for the sentiment classifier. word_tokenize splits the title into individual words, then the text is converted to lowercase (title.lower()), then removes words that are not alphanumeric (e.g. punctuation marks) and words that are in the stopwords list. Then using (dict(counter(tokens))) creates a feature dictionary that contains the number of occurrences of each word in the title. Finally, it adds the processed title and the assigned sentiment to the processed list.

In [12]:
def preprocess_titles(titles, sentiment):
    processed = []
    for title in titles:
        tokens = word_tokenize(title.lower())
        tokens = [token for token in tokens if token.isalnum() and token not in STOPWORDS]
        processed.append((dict(Counter(tokens)), sentiment))
    return processed

## 3.6. Function: classify_titles(classifier, titles)
This function was used to classify the sentiment of each article title using a trained Naive Bayes classifier. 

In [13]:
def classify_titles(classifier, titles):
    classified = []
    for title in titles:
        tokens = word_tokenize(title.lower())
        tokens = [token for token in tokens if token.isalnum() and token not in STOPWORDS]
        features = dict(Counter(tokens))
        sentiment = classifier.classify(features)
        classified.append((title, sentiment))
    return classified

## 3.7. Function: main i.e. downloading titles, labels and results
Firstly, titles were downloaded from 5 pages on sport.pl about Iga Świątek. Later, sample labels (‘positive’, ‘negative’, ‘neutral’) were assigned, then the Naive Bayes classifier was trained on the basis of the titles and labels. The titles were then classified using the trained model, the number of titles assigned to each category (positive, negative, neutral) was counted and, at the very end, the classification results were displayed in a table.

In [16]:
def main():
    urls = [
        "https://www.sport.pl/iga-swiatek",
        "https://www.sport.pl/iga-swiatek/2",
        "https://www.sport.pl/iga-swiatek/3",
        "https://www.sport.pl/iga-swiatek/4",
        "https://www.sport.pl/iga-swiatek/5"
    ]
    
    all_titles = []
    
    print("Downloading article titles from given sites...")
    try:
        for url in urls:
            titles = get_titles_from_website(url)
            all_titles.extend(titles)
    except Exception as e:
        print(f"Mistake: {e}")
        return
    
    if not all_titles:
        print("No titles found.")
        return
    
    print(f"Downloaded {len(all_titles)} titles from {len(urls)} pages.")

    random.seed(42)
    labels = ["positive", "negative", "neutral"]
    labeled_data = [
        (title, random.choice(labels)) for title in all_titles
    ]

    train_data, test_data = train_test_split(labeled_data, test_size=0.2, random_state=42)

    train_set = []
    for title, sentiment in train_data:
        train_set.extend(preprocess_titles([title], sentiment))
    
    test_set = []
    for title, sentiment in test_data:
        test_set.extend(preprocess_titles([title], sentiment))

    classifier = NaiveBayesClassifier.train(train_set)

    print("\nEvaluation of the model on test data:")
    print(f"Accuracy: {accuracy(classifier, test_set) * 100:.2f}%")

    classified_titles = classify_titles(classifier, all_titles)

    print("\nNaive Bayes classification results:")
    table = [[i + 1, title, sentiment] for i, (title, sentiment) in enumerate(classified_titles)]
    headers = ["#", "Title", "Sentiment"]
    print(tabulate(table, headers, tablefmt="grid"))
    
    sentiment_counts = Counter([sentiment for _, sentiment in classified_titles])
    print("\nSummary of classification:")
    for sentiment, count in sentiment_counts.items():
        print(f"{sentiment.capitalize()}: {count}")

if __name__ == "__main__":
    main()

Downloading article titles from given sites...
Downloaded 200 titles from 5 pages.

Evaluation of the model on test data:
Accuracy: 47.50%

Naive Bayes classification results:
+-----+-------------------------------------------------------------------------------------------------+-------------+
|   # | Title                                                                                           | Sentiment   |
|   1 | GIF przerwał milczenie ws. leku Świątek. "Niezapowiedziana inspekcja"                           | neutral     |
+-----+-------------------------------------------------------------------------------------------------+-------------+
|   2 | Słowacy od rana piszą o tym, co zrobiła Świątek. "Okrutny koniec"                               | positive    |
+-----+-------------------------------------------------------------------------------------------------+-------------+
|   3 | Wtedy Świątek wyprzedzi Sabalenkę w rankingu. Oto scenariusze                                   

# 4. Results
The results of the study indicated that the research problem was not confirmed because, most of the titles from sport.pl articles about Iga Świątek were neutral (74) or positive (68) and not negative (58). The exact numbers may be altered as several new articles about Iga appear throughout the day, but the difference is large enough that the research problem would still be refuted.
