##### Sentiment Analysis of English Newspaper Articles 

This code analyzes the sentiment of English newspaper articles from Hankyoreh, Joongang, ABC News, and Fox News on the COVID-19 Pandemic around 2020. 

Basic Steps of the code are:
 - Step 1. Instantiate Model
 - Step 2. Calculate Sentiment
 - Step 3. Visualization & Summary of Results

This code is written by Hyebin Seo. 

##### Step 1. Import Libraries & Read in Excel Files

In [17]:
import pandas as pd

import nltk

nltk.download("punkt")
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [18]:
def read_df(path):
    """
    Loads an Excel file from the specified path into a pandas DataFrame.

    :param path: String representing the file path to the Excel file.
    :return: A pandas DataFrame containing the data from the Excel file.
    """

    df = pd.read_excel(path)
    # fmt: off
    df = df.rename(columns={
        "Country": "country",
        "신문사": "newspaper", 
        "제목": "title", 
        "본문": "body", 
        "URL": "url"})
    # fmt: on
    return df

In [19]:
paths = input("Please input path to excel file(s) separated with commas: ")

In [20]:
# read in csv data at path(s)

if "," in paths:
    paths = paths.split(",")
    raw_df = pd.concat([read_df(path) for path in paths])
else:
    raw_df = read_df(paths)


##### Step 2. Search for Sentences with Target Word

In [21]:
target_word = input("Enter a word to find its sentiment: ")

In [22]:
def split_into_sentence(df, column="body"):
    """
    Splits text in specified column of a dataframe into sentences

    :param df: DataFrame with a column containing text to be split
    :param column: The name of the column in DataFrame that contains text to be split into sentences
    :return: A new DataFrame where each row corresponds to a sentence from the text in 'column'
    """

    df[column] = df[column].apply(sent_tokenize)

    split_df = df.explode(column, ignore_index=True)
    
    return split_df

In [23]:
def sentences_with_target(df, column="body", word=target_word):
    """
    Returns a dataframe of news sentences containing input word

    :param df: a dataframe containing sentences from article
    :param column: a column to search for input word
    :param word: the target word to search for
    :return: a dataframe of sentences from the dataframe
    """
    mask = df[column].str.contains(target_word, na=False, case=False) 
    filtered_df = df[mask]

    return filtered_df

In [24]:
filtered_df = sentences_with_target(split_into_sentence(raw_df))

In [25]:
filtered_df["body"] = filtered_df["body"].replace({'masks': 'mask'}, regex=True)
filtered_df["body"] = filtered_df["body"].replace({'Masks': 'mask'}, regex=True)
filtered_df["body"] = filtered_df["body"].replace({'Mask': 'mask'}, regex=True)

In [26]:
filtered_df.head(10)

Unnamed: 0,country,newspaper,title,body,url
0,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,Seoul’s subway app will soon enable users to r...,https://www.hani.co.kr/arti/english_edition/e_...
1,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,The Seoul Metropolitan Government announced th...,https://www.hani.co.kr/arti/english_edition/e_...
2,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,"When such a report is made, security staff wil...",https://www.hani.co.kr/arti/english_edition/e_...
3,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,Seoul plans to issue fines and take other ster...,https://www.hani.co.kr/arti/english_edition/e_...
4,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,Even though it’s now mandatory to wear mask on...,https://www.hani.co.kr/arti/english_edition/e_...
6,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,"From May 13, when Seoul implemented “daily soc...",https://www.hani.co.kr/arti/english_edition/e_...
7,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,There were also five cases in which subway emp...,https://www.hani.co.kr/arti/english_edition/e_...
9,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,"During the same period, city buses have seen a...",https://www.hani.co.kr/arti/english_edition/e_...
10,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,“Many subway passengers who see someone not we...,https://www.hani.co.kr/arti/english_edition/e_...
19,Korea,Hankyoreh,[Reportage] Behind the scenes of S. Korea’s di...,"Behind their mask, their faces showed signs of...",https://www.hani.co.kr/arti/english_edition/e_...


##### Step 3. Run Semantic Analysis using NewsMTSC

In [27]:
import tqdm as notebook_tqdm
from NewsSentiment import TargetSentimentClassifier

In [28]:
model = TargetSentimentClassifier()

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [29]:
from NewsSentiment.customexceptions import TooLongTextException, TargetNotFoundException

In [30]:
def sentiment_analysis(sentences, word=target_word):
    """
    Returns a dataframe of the results of sentiment analysis

    :param sentences: the sentences to run sentiment analysis on
    :param word: the target word to run a sentiment analysis on
    :return: a dataframe of the results of sentiment analysis
    """
    left_of_word, match, right_of_word = sentences.rpartition(word) # multiple words in same sentence?, check rpartition

    try:
        result = model.infer_from_text(left_of_word, match, right_of_word)
        df = pd.DataFrame(result)
        classification_result = df.loc[
            df["class_prob"].idxmax()
        ]  # get the row with the highest probability
        return classification_result
    
    except TargetNotFoundException:
        print("Trarget word was not found in the text. Perhaps the target is in a different case? Perhaps you misspelled it?")

    except TooLongTextException:
        print("Text is too long, split the text into smaller chunks")  
        return pd.Series([None] * 3)

    except Exception: # could not figure it out # log 작성하기(?), exception에 대한 근거 
                      # 한 번 더 다루어보기
                      # DE → make developer 확인하기, ex) print에서 확인하기
        print("An error has occured")
        return pd.Series([None] * 3)  # error for now, fix later

In [31]:
def all_result(df):
    """
    Returns a combined dataframe of the original and sentiment analysis results

    :param df: a dataframe of the original dataframe of news articles
    :return: a combined dataframe of the original and sentiment analysis results
    """

    result_df = df["body"].apply(sentiment_analysis)

    final_df = pd.concat([df, result_df], axis=1)
    final_df = final_df.drop(labels=[0, 1, 2], axis=1)
    final_df = final_df.dropna()
    final_df = final_df.reset_index(drop=True)

    return final_df

In [32]:
raw_result = all_result(filtered_df)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Text is too long, splitting into smaller chunks
An error has occured
An error has occured
An error has occured


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


In [33]:
raw_result.head(2)

Unnamed: 0,country,newspaper,title,body,url,class_id,class_label,class_prob
0,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,Seoul’s subway app will soon enable users to r...,https://www.hani.co.kr/arti/english_edition/e_...,1.0,neutral,0.604794
1,Korea,Hankyoreh,Seoul’s subway app to allow users to report pa...,The Seoul Metropolitan Government announced th...,https://www.hani.co.kr/arti/english_edition/e_...,1.0,neutral,0.639253


##### Step 4. Filter Results of Semantic Analysis to Export into One CSV File

In [34]:
def filter_result(df):
    """
    Returns a dataframe where the probability of sentiment analysis is higher than 0.7

    :param df: the combined df returned from function all_result
    :return: a dataframe where the probability of sentiment analysis is greater than 0.7
    """
    result_df = df[df["class_prob"] > 0.7]
    return result_df.reset_index(drop=True)  # reset index

In [35]:
# combines and exports results of all newspapers into one csv file

final_results = filter_result(raw_result)

In [36]:
final_results.to_excel("results.xlsx")

##### Step 5. Run a Chi-Sqaured Test on Results

In [37]:
import scipy.stats as stats
import numpy as np

In [38]:
target_word

'mask'

In [39]:
def contingency_table(df, index_name="class_label", column_name="country"):
    """
    Returns a contingency table of the results of sentiment analysis

    :param df: a dataframe of the results of sentiment analysis
    :param column_name: the index for the contingency table
    :param column_name: the column for the contingency table
    :param word: the target word to run a sentiment analysis on
    :return: a contingency table of the results of sentiment analysis
    """

    contingency_table = pd.crosstab(df[index_name], df[column_name])

    return contingency_table

In [40]:
def chi_sqaure_test(contingency_table):
    """
    Prints the results of a chi-square test

    :param contingency_table: a contingency table to run chi-square test on

    :return: None
    """
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
    residuals = (contingency_table - expected) / np.sqrt(expected)

    print(f"Result of comparison:\n")

    print(f"Chi-squared Statistic: {chi2}")

    print(f"Degree of freedom: {dof}")

    print(f"Expected:{expected}\n")

    print(f"P-value: {p}")

    print(f"Residuals: {residuals}")

In [41]:
contingency_table(final_results)

country,Korea,US
class_label,Unnamed: 1_level_1,Unnamed: 2_level_1
negative,28,95
neutral,68,83
positive,11,23


In [42]:
chi_sqaure_test(contingency_table(final_results))

Result of comparison:

Chi-squared Statistic: 14.922966447485889
Degree of freedom: 2
Expected:[[42.73051948 80.26948052]
 [52.45779221 98.54220779]
 [11.81168831 22.18831169]]

P-value: 0.0005748029774811731
Residuals: country         Korea        US
class_label                    
negative    -2.253455  1.644155
neutral      2.145891 -1.565675
positive    -0.236175  0.172317
