# Sentiment analysis in newspapers
The program is meant to analyse the sentiment in newspapers using the NLTK library.
## Current version
The program uses BeautifulSoup to get the headline and the article from the Fox News webpage. Then it analyses the text using a sentiment analysis tool incorporated into Vader. The program tokenises the text, lemmatises it, and removes non-alphabetic characters. Then the program removes stopwords (e.g. articles.) from the list of tokens. The text processing ends with merging the list of tokens into one list and analysing using Sentiment Intensity Analyzer. In the main version, the program saves the results in new files. In this assignment, the program prints the results in the terminal. In the next step, the program tokenises the article into sentences. Then, each sentence is analysed separately. In the end, the program summarises the sentiment score for the sentences.
## Data
I tested the program on two articles from Fox News: _Biden admin's plan for mass release of migrants into US outlined in internal 2022 memo_ and _Florida National Guard troops 'proud to help’ fight Texas border crisis, DeSantis says_. The first text seemed negative towards President **Biden's administration**, while the second was positive towards **Republican governors**. As such, the result of sentiment analysis was predictable, and malfunction was easily detected.
## Senitment analysis theory
### Sentiment classification
There are three levels to consider (D'Andrea et al., 2015):
* document level – we treat the whole document as a basic information unit;
* sentence-level – we treat each sentence as an information unit;
* aspect-level – each aspect of entities is treated as an information unit.

### NLTK
The Natural Language Toolkit (NLTK) is a library in Python for NLP tasks. It provides numerous tools for processing and analyzing text.

#### Why NLTK?
* NLTK library provides built-in modules, including tokenization, lemmatization and sentiment analysis.
* It provides pre-trained/evaluated models and datasets.
* Documentation, support and customization are easily found online.

### VADER (Valence Aware Dictionary and sEntiment Reasoner)
The VADER lexicon is designed for sentiment analysis in social media texts. It contains words and their associated sentiment scores. The VADER algorithm is more useful for SA of short texts (social media) – it focuses on elements like emoticons or capitalization. However, for this project, the program correctly calculated the sentiment of the given text (newspaper article).



In [6]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize

nltk.download('stopwords')
nltk.download('punkt') #moduł do tokenizacji etc.
nltk.download('vader_lexicon')
nltk.download('wordnet')
nltk.download('omw-1.4')

def processing(tekst):
    tokens = word_tokenize(tekst.lower())
    lemmatyzacja = WordNetLemmatizer()
    tokens_lemma = [lemmatyzacja.lemmatize(token) for token in tokens]
    tokens_without = [token for token in tokens_lemma if token.isalpha()]
    stop_words = set(stopwords.words('english')) #nazwa nie może być stopwords, żeby się nie pokrywała z stopwords.words()!!!
    tokens_stop = [token for token in tokens_without if token not in stop_words]
    return tokens_lemma, tokens_without, tokens_stop, tokens

def sentiment_analysis(tekst):
    sentiment_tool = SentimentIntensityAnalyzer()
    score = sentiment_tool.polarity_scores(tekst)
    sentiment_score = score['compound']
    if sentiment_score >= 0.05:
        sentiment = 'Positive sentiment'
    elif sentiment_score <= -0.05:
        sentiment = 'Negative sentiment'
    else:
        sentiment = 'Neutral sentiment'
    return sentiment, sentiment_score, score

def sentence_sentiment(article_body):
    sentiment_tool = SentimentIntensityAnalyzer()
    positive_score = 0
    negative_score = 0
    neutral_score = 0
    total_sentiment_score = 0
    sentences = sent_tokenize(article_body)
    sentiment_results = []
    for sentence in sentences:
        scores = sentiment_tool.polarity_scores(sentence)
        sentiment_score = scores['compound']
        total_sentiment_score += sentiment_score
        if sentiment_score >= 0.05:
            sentiment = 'Positive'
            positive_score += 1
        elif sentiment_score <= -0.05:
            sentiment = 'Negative'
            negative_score += 1
        else:
            sentiment = 'Neutral'
            neutral_score += 1
        sentiment_results.append((sentence, sentiment, sentiment_score))
    if positive_score > negative_score:
        overall_sentiment = 'Positive'
    elif negative_score > positive_score:
        overall_sentiment = 'Negative'
    else:
        overall_sentiment = 'Neutral'
    average_sentiment_score = total_sentiment_score / len(sentences)
    return overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results

def get_data(page_url):   
    # Send a GET request to the webpage
    response = requests.get(page_url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Get the headline of the article
    headline = soup.find("h1").text.strip()

    # Extract the article body
    article_body = soup.find("div", class_="article-body").text.strip()
    return article_body,headline

page_url = 'https://www.foxnews.com/politics/biden-admins-plan-mass-release-migrants-us-outlined-internal-2022-memo'

article_body,headline = get_data(page_url)
tokens, tokens_without, tokens_stop, tokens_lemma = processing(article_body)
output_text = ' '.join(tokens_stop)

sentiment, sentiment_score, score = sentiment_analysis(output_text)
sentyment = str(sentiment)
wynik = str(sentiment_score)
print('Sentiment analysis:')
print('Sentiment:' + sentyment +'\n')
print('Sentiment score:' + wynik +'\n')

# Sentence analysis
overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results = sentence_sentiment(article_body)
print('Sentence analysis:')
print("Overall:" + str(overall_sentiment) +'\n')
print("Total:" + str(total_sentiment_score) +'\n')
print("Average:" + str(average_sentiment_score) +'\n')
for result in sentiment_results:
    print(f"Sentence: {result[0]}\n")
    print(f"Sentiment: {result[1]}\n")
    print(f"Sentiment Score: {result[2]}\n")
    print('\n')

# Headline analysis
overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results = sentence_sentiment(headline)
print('Headline analysis by sentence:')
print("Overall:" + str(overall_sentiment) +'\n')
print("Total:" + str(total_sentiment_score) +'\n')

tokens, tokens_without, tokens_stop, tokens_lemma = processing(headline)
output_text = ' '.join(tokens_stop)
sentiment, sentiment_score, score = sentiment_analysis(output_text)
sentyment = str(sentiment)
wynik = str(sentiment_score)
print('Headline analysis:')
print('Sentiment:' + sentyment +'\n')
print('Sentiment score:' + wynik +'\n')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Sentiment analysis:
Sentiment:Negative sentiment

Sentiment score:-0.6303

Sentence analysis:
Overall:Negative

Total:-2.4141

Average:-0.08621785714285714

Sentence: close      Video Illegal border crossers line up ahead of expiration of Title 42 Radio host Sergio Sanchez shares how Texas locals are reacting to the expiration of Title 42.FIRST ON FOX: A decision to authorize all Border Patrol Sectors to begin "safe" mass releases of migrants to city streets if non-governmental organizations (NGOs) are overcapacity will take place in line with a 2022 memo that was uncovered during legal proceedings initiated by Florida Attorney General Ashley Moody last year and outlined how to handle releases when Title 42 ends.

Sentiment: Negative

Sentiment Score: -0.5267



Sentence: Fox News on Tuesday reported that top border officials in Washington, D.C., have decided to authorize all Border Patrol Sectors to begin the releases if Customs and Border Protection (CBP) and NGOs can’t hold migrants

In [7]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize

nltk.download('stopwords')
nltk.download('punkt') #moduł do tokenizacji etc.
nltk.download('vader_lexicon')
nltk.download('wordnet')

def processing(tekst):
    tokens = word_tokenize(tekst.lower())
    lemmatyzacja = WordNetLemmatizer()
    tokens_lemma = [lemmatyzacja.lemmatize(token) for token in tokens]
    tokens_without = [token for token in tokens_lemma if token.isalpha()]
    stop_words = set(stopwords.words('english')) #nazwa nie może być stopwords, żeby się nie pokrywała z stopwords.words()!!!
    tokens_stop = [token for token in tokens_without if token not in stop_words]
    return tokens_lemma, tokens_without, tokens_stop, tokens

def sentiment_analysis(tekst):
    sentiment_tool = SentimentIntensityAnalyzer()
    score = sentiment_tool.polarity_scores(tekst)
    sentiment_score = score['compound']
    if sentiment_score >= 0.05:
        sentiment = 'Positive sentiment'
    elif sentiment_score <= -0.05:
        sentiment = 'Negative sentiment'
    else:
        sentiment = 'Neutral sentiment'
    return sentiment, sentiment_score, score

def sentence_sentiment(article_body):
    sentiment_tool = SentimentIntensityAnalyzer()
    positive_score = 0
    negative_score = 0
    neutral_score = 0
    total_sentiment_score = 0
    sentences = sent_tokenize(article_body)
    sentiment_results = []
    for sentence in sentences:
        scores = sentiment_tool.polarity_scores(sentence)
        sentiment_score = scores['compound']
        total_sentiment_score += sentiment_score
        if sentiment_score >= 0.05:
            sentiment = 'Positive'
            positive_score += 1
        elif sentiment_score <= -0.05:
            sentiment = 'Negative'
            negative_score += 1
        else:
            sentiment = 'Neutral'
            neutral_score += 1
        sentiment_results.append((sentence, sentiment, sentiment_score))
    if positive_score > negative_score:
        overall_sentiment = 'Positive'
    elif negative_score > positive_score:
        overall_sentiment = 'Negative'
    else:
        overall_sentiment = 'Neutral'
    average_sentiment_score = total_sentiment_score / len(sentences)
    return overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results

def get_data(page_url):   
    # Send a GET request to the webpage
    response = requests.get(page_url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Get the headline of the article
    headline = soup.find("h1").text.strip()

    # Extract the article body
    article_body = soup.find("div", class_="article-body").text.strip()
    return article_body,headline

page_url = 'https://www.foxnews.com/politics/florida-national-guard-troops-proud-help-fight-texas-border-crisis-desantis-says'

article_body,headline = get_data(page_url)
tokens, tokens_without, tokens_stop, tokens_lemma = processing(article_body)
output_text = ' '.join(tokens_stop)

sentiment, sentiment_score, score = sentiment_analysis(output_text)
sentyment = str(sentiment)
wynik = str(sentiment_score)
print('Sentiment analysis:')
print('Sentiment:' + sentyment +'\n')
print('Sentiment score:' + wynik +'\n')

# Sentence analysis
overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results = sentence_sentiment(article_body)
print('Sentence analysis:')
print("Overall:" + str(overall_sentiment) +'\n')
print("Total:" + str(total_sentiment_score) +'\n')
print("Average:" + str(average_sentiment_score) +'\n')
for result in sentiment_results:
    print(f"Sentence: {result[0]}\n")
    print(f"Sentiment: {result[1]}\n")
    print(f"Sentiment Score: {result[2]}\n")
    print('\n')

# Headline analysis
overall_sentiment,total_sentiment_score,average_sentiment_score,sentiment_results = sentence_sentiment(headline)
print('Headline analysis by sentence:')
print("Overall:" + str(overall_sentiment) +'\n')
print("Total:" + str(total_sentiment_score) +'\n')

tokens, tokens_without, tokens_stop, tokens_lemma = processing(headline)
output_text = ' '.join(tokens_stop)
sentiment, sentiment_score, score = sentiment_analysis(output_text)
sentyment = str(sentiment)
wynik = str(sentiment_score)
print('Headline analysis:')
print('Sentiment:' + sentyment +'\n')
print('Sentiment score:' + wynik +'\n')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Stani\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Sentiment analysis:
Sentiment:Positive sentiment

Sentiment score:0.6486

Sentence analysis:
Overall:Positive

Total:1.0437

Average:0.029820000000000003

Sentence: close      Video Terror arrests at southern border rising at alarming rate Rep. Tony Gonzales, R-Texas, addresses the Homeland Security threat posed by the migrant crisis.Gov.

Sentiment: Negative

Sentiment Score: -0.8316



Sentence: Ron DeSantis said Florida is "proud to help" secure the southern border days after Texas Gov.

Sentiment: Positive

Sentiment Score: 0.802



Sentence: Greg Abbott announced the arrival of National Guard troops from the Sunshine State.

Sentiment: Positive

Sentiment Score: 0.4939



Sentence: "We are proud to help Texas fight Biden’s Border Crisis," DeSantis tweeted late Sunday in response to a May 23 tweet by Abbott.

Sentiment: Positive

Sentiment Score: 0.4939



Sentence: "Florida National Guard service members arrived in Texas over the weekend.

Sentiment: Neutral

Sentiment Score: 0.0


# The results
## First arcticle
The first article focused on President Biden's plans for migration in the USA. The predicted result for this article was a negative sentiment score. The program correctly calculated the score - in both text and sentence analysis, the resulting score was negative. Surprisingly, the result for sentiment analysis of the headline was neutral in both cases.

**Sentiment analysis**:

* **Sentiment**:Negative sentiment

* **Sentiment score**:-0.6303

* **Sentence analysis**:

    * **Overall**:Negative

    * **Total**:-2.4141

    * **Average**:-0.08621785714285714

**Headline analysis by sentence**:

* **Overall**:Neutral

* **Total**:0.0

**Headline analysis**:

* **Sentiment**:Neutral sentiment

* **Sentiment score**:0.0

## Second article
The second article covered how Republican governors responded to the migrant crisis in the United States. The predicted result was a positive sentiment score. This time the program was also correct. In both text and sentence analysis, the resulting sentiment score was positive. However, the sentiment for the headline was negative, which suggests problems in detecting sentiment for more subtle, ambiguous sentences.

**Sentiment analysis**:

* **Sentiment**:Positive sentiment

* **Sentiment score**:0.6486

* **Sentence analysis**:

    * **Overall**:Positive

    * **Total**:1.0437

    * **Average**:0.029820000000000003

**Headline analysis by sentence**:

* **Overall**:Negative

* **Total**:-0.5574

**Headline analysis**:

* **Sentiment**:Negative sentiment

* **Sentiment score**:-0.6124

# Conclusion
The program correctly detected sentiment using both methods (analysing the processed text and analysing the sentences). However, the program detected negative sentiment in the headline of the second article. That suggests that the tools provided by the NLTK library might be imprecise. The text with ambiguous/mixed sentiment will be challenging for analysis. What is more, SA tools might have problems with sarcasm or irony.


Overall, NLTK provides convenient and efficient tools for SA (e.g. "SentimentIntensityAnalyzer" with sentiment and compound scores). Preprocessing - tokenisation, removing special characters, stopwords and lemmatisation can improve the accuracy of SA. However, it is essential to remember the limitations.

# References
* https://www.foxnews.com/politics/biden-admins-plan-mass-release-migrants-us-outlined-internal-2022-memo
* https://www.foxnews.com/politics/florida-national-guard-troops-proud-help-fight-texas-border-crisis-desantis-says
* https://www.nltk.org/api/nltk.html
* D’Andrea, A., Ferri, F., Grifoni, P., & Guzzo, T. (2015). Approaches, tools and applications for 	sentiment analysis implementation. International Journal of Computer Applications, 	125(3), 26–33. https://doi.org/10.5120/ijca2015905866 



