# Sentiment Analysis of DaNewsRoom
Sentiment Analysis using DaNLP's [BERT TONE](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md) for the Cultural Data Science Project 2022 by @drasbaek and @MinaAlmasi

Using the [DaNewsRoom dataset](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md)

## Import Packages & Mount Google Drive

In [20]:
# import packages for data import
import gzip 
import pandas as pd 
from ast import literal_eval # used to import csv again with the pandas types (as it has converted them into lists

In [21]:
#progress bar 
!pip -q install tqdm ipywidgets
from tqdm import tqdm
import time

In [22]:
# mount google drive (if run from google colab)
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [23]:
#packages for sentiment analysis
!pip install -q pandas datasets danlp transformers

In [24]:
import matplotlib.pyplot as plt #import for plotting
import matplotlib as mpl #import for plotting high res

In [30]:
 #check GPU 
 !nvidia-smi -L

GPU 0: A100-SXM4-40GB (UUID: GPU-f4c6471f-7e08-31d8-01bd-9d4bf8a7ca00)


## Data import

In [25]:
#load in the pre-processed dataset 
filepath = "/content/drive/MyDrive/002 cultural-data-science/data/preprocessed_DaNewsRoom.csv"

#read data in chunks of 100.000 rows at a time 
chunk = pd.read_csv(filepath, chunksize=10000)
data = pd.concat(chunk)

ParserError: ignored

In [None]:
data.head()

## Subsetting Data

### Making Year Column

In [None]:
import re 

#define function to extract year from the url in "archive"
def extract_year_from_url(url):
    # use a regular expression to extract the year from the URL
    year_match = re.search(r'web/\d{4}', url)
    if year_match:
        # return year as an integer
        return int(year_match.group()[-4:])
    else:
        # if the year cannot be extracted, return None
        return None

# define function which uses extract_year_from_url to create year column for data
def create_year_column(data):
    # apply the extract_year_from_url function to the 'archive' column and store the result in the 'year' column
    data['year'] = data['archive'].apply(extract_year_from_url)

# use function
create_year_column(data)

### Checking the distribution of Year Across Domains

In [None]:
import seaborn as sns

# use seaborn's facetgrid to create a grid of histograms, with one histogram for each domain
g = sns.FacetGrid(data, col="domain", col_wrap=3, sharex=False, sharey=False)

# for each domain, create a histogram with the same bin edges
bins = len(set(data['year']))
g.map(plt.hist, "year", bins=bins, color="#FF00B3")

# adjust layout and show plot
g.set_titles("{col_name}")
g.set(xlim=(min(data['year']), max(data['year'])))
g.set_ylabels("Frequency")
g.set_xlabels("Year")
plt.show()

The plot above shows that the dataset is not equally distributed across years for the different domains. We therefore take a subset of 10000 rows for each domain in the same time period. Berlingske seems to have a very specific time period of articles and is also the general outlier in terms of its political-bias. For this reason, we decide to remove it.

### Removing Berlingske

In [None]:
data = data[data.domain != "berlingske"]

### Subsetting 10000 rows from each domain in a fixed time period

In [None]:
import random

# Create a new data frame with 10000 rows for each domain, containing randomly selected rows from the original data frame
subset_data = pd.concat([data[(data['domain'] == domain) & (data['year'] > 2014)].sample(10000, random_state=42) for domain in data['domain'].unique()])

# Shuffle the rows of the new data frame
subset_data = subset_data.sample(frac=1, random_state=42).reset_index(drop=True)

### Checking Year Distribution after Subsetting

In [None]:
# use seaborn's facetgrid to create a grid of histograms, with one histogram for each domain
g = sns.FacetGrid(subset_data, col="domain", col_wrap=3, sharex=False, sharey=False)

# for each domain, create a histogram with the same bin edges
bins = len(set(subset_data['year']))
g.map(plt.hist, "year", bins=bins, color="#c71f1f")

# adjust layout and show plot
g.set_titles("{col_name}")
g.set(xlim=(min(subset_data['year']), max(subset_data['year'])))
g.set_ylabels("Frequency")
g.set_xlabels("Year")
plt.show()

BT is still of concern, but since it is not an outlier in terms of political bias like Berlingske, it is kept in the data analysis. 

## BERT TONE CLASSIFICATION

Sentiment Analysis using DaNlP's pretrained bert model:
https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md

### Loading BERT Tone Model

In [None]:
from danlp.models import load_bert_tone_model

#load model 
classifier = load_bert_tone_model()

#### Using BERT for Sentiment Analysis

In [None]:
# using the classifier to get predictions (tone and sentiment)
predictions = [] #define empty list to be appended to in for loop

for i in tqdm(range(len(subset_data))):
  predictions.append(classifier.predict(subset_data["title"][i]))

In [None]:
# using the classifier to get probabilities for the categorisations
probabilities = [] 

for i in tqdm(range(len(subset_data))):
  probabilities.append(classifier.predict_proba(subset_data["title"][i]))

In [None]:
# checking the classes
classifier._classes()

In [None]:
import numpy as np

#make probabilities into seperate columns by splitting up the list of arrays with five values into five columns
probabilities_data = pd.DataFrame([array[0].tolist()+array[1].tolist() for array in probabilities],columns=['positive_probability','neutral_probability','negative_probability','objective_probability', "subjective_probability"])
probabilities_data

# convert BERT sentiment predictions into dataframe
predictions_data = pd.DataFrame(predictions)

#rename polarity into "sentiment"
predictions_data = predictions_data.rename(columns={"polarity": "sentiment"})

#combine all into final dataframe
subset_data = pd.concat([subset_data, predictions_data, probabilities_data], axis=1)

In [27]:
subset_data.head()

Unnamed: 0.1,Unnamed: 0,url,archive,title,date,text,summary,density,coverage,compression,...,tokens,tokens_without_sw,year,analytic,sentiment,positive_probability,neutral_probability,negative_probability,objective_probability,subjective_probability
0,98480,http://www.dr.dk/nyheder/indland/rigspolitiet-...,https://web.archive.org/web/20150923130641/htt...,Rigspolitiet: Nu er 10.000 flygtninge rejst in...,1970-08-22 05:28:43.130641,"Strømmen af flygtninge og migranter, der rejse...","Strømmen af flygtninge og migranter, der rejse...",17.0,1.0,11.764706,...,"['Strømmen', 'af', 'flygtninge', 'og', 'migran...","['Strømmen', 'flygtninge', 'migranter', 'rejse...",2015,objective,neutral,0.000944,0.982416,0.016641,0.99993,7e-05
1,30752,http://www.bt.dk/ishockey/her-er-de-hold-danma...,https://web.archive.org/web/20150113085846/htt...,"Her er de hold, Danmarks U20-helte skal møde i...",1970-08-22 05:15:13.085846,"Alt tyder på en ny skæbnekamp mod Schweiz, når...","Alt tyder på en ny skæbnekamp mod Schweiz, når...",22.0,1.0,10.227273,...,"['Alt', 'tyder', 'på', 'en', 'ny', 'skæbnekamp...","['tyder', 'skæbnekamp', 'Schweiz', 'U20-VM', '...",2015,objective,neutral,0.002789,0.996778,0.000434,0.999905,9.5e-05
2,240776,http://www.bt.dk/politik/flertal-i-aarhus-bakk...,https://web.archive.org/web/20150618221344/htt...,Flertal i Aarhus bakker op om lufthavns-planer,1970-08-22 05:23:38.221344,Ved onsdagens byrådsmøde i Aarhus Kommune var ...,På et byrådsmøde i Aarhus Kommune onsdag aften...,4.882353,0.941176,7.294118,...,"['Ved', 'onsdagens', 'byrådsmøde', 'i', 'Aarhu...","['onsdagens', 'byrådsmøde', 'Aarhus', 'Kommune...",2015,objective,neutral,0.014782,0.984837,0.000381,0.999597,0.000403
3,366601,http://nyheder.tv2.dk/erhverv/2017-01-10-forst...,https://web.archive.org/web/20170501113640/htt...,Første spadestik tages i Københavns Lufthavne:...,1970-08-22 10:55:01.113640,Håndværkere må på natarbejde for at mindske ge...,Håndværkere må på natarbejde for at mindske ge...,19.0,1.0,34.315789,...,"['Håndværkere', 'må', 'på', 'natarbejde', 'for...","['Håndværkere', 'natarbejde', 'mindske', 'gene...",2017,subjective,neutral,0.021429,0.976693,0.001878,7.1e-05,0.999929
4,520096,http://www.dr.dk/nyheder/politik/valg2015/graf...,https://web.archive.org/web/20151024165830/htt...,GRAFIK Se hvor der mangler praktiserende læger,1970-08-22 05:30:24.165830,Det er svært at lokke praktiserende læger til ...,Det er svært at lokke praktiserende læger til ...,20.0,1.0,17.85,...,"['Det', 'er', 'svært', 'at', 'lokke', 'praktis...","['svært', 'lokke', 'praktiserende', 'læger', '...",2015,objective,neutral,0.002479,0.914683,0.082838,0.99991,9e-05


## Initial Plotting to Look at BERT's Predictions

### Plotting Sentiment Across Domains

In [None]:
# group sentiment per domain
sentiment_per_domain = subset_data.groupby(["domain", "sentiment"])["sentiment"].count()
sentiment_per_domain

In [None]:
# make sentiment values (positive, neutral, negative) into columns
sentiment_per_domain = sentiment_per_domain.unstack()
sentiment_per_domain

In [None]:
# convert into dataframe
sentiment_per_domain = pd.DataFrame(sentiment_per_domain)

Prepare Data for Stacked Barplot:

In [None]:
# create sum column
sentiment_per_domain["sum"] = sentiment_per_domain["negative"] + sentiment_per_domain["neutral"] + sentiment_per_domain["positive"]

# create proportion column
sentiment_per_domain["proportion_negative"] = sentiment_per_domain["negative"]/sentiment_per_domain["sum"]
sentiment_per_domain["proportion_neutral"] = sentiment_per_domain["neutral"]/sentiment_per_domain["sum"]
sentiment_per_domain["proportion_positive"] = sentiment_per_domain["positive"]/sentiment_per_domain["sum"]

# select only proportion columns for plot 
grouped = sentiment_per_domain[["proportion_negative", "proportion_neutral","proportion_positive"]]

Plot:

In [None]:
# define plot resolution and size
mpl.rcParams['figure.dpi'] = 150

plt.figure(figsize=(5,4))

# plot values
grouped.plot(kind="bar", stacked="True", color =["#ff7678", "#d7d7d7", "#a0ff9f"])
plt.legend(bbox_to_anchor=(1.02, 0.6), loc="upper left", borderaxespad=0, 
           labels=["Negative", "Neutral", "Positive"], 
           title = "Sentiment")

### Plotting Analytic (Tone) Across Domains

In [None]:
#group analytic by domain
analytic_per_domain = subset_data.groupby(["domain", "analytic"])["analytic"].count()

# make objective and subjective into columns
analytic_per_domain = analytic_per_domain.unstack() 

# convert into pandas dataframe 
analytic_per_domain = pd.DataFrame(analytic_per_domain)

Prepare Data for Stacked Barplot:

In [None]:
# create sum column
analytic_per_domain["sum"] = analytic_per_domain["objective"] + analytic_per_domain["subjective"] 

# create proportion column
analytic_per_domain["proportion_subjective"] = analytic_per_domain["subjective"]/analytic_per_domain["sum"]
analytic_per_domain["proportion_objective"] = analytic_per_domain["objective"]/analytic_per_domain["sum"]

# select only proportion columns for plot 
grouped_analytic = analytic_per_domain[["proportion_subjective", "proportion_objective"]]

Make Plot

In [None]:
#define resolution of plot and figure size 
mpl.rcParams['figure.dpi'] = 150
plt.figure(figsize=(5,4))

#plot values
grouped_analytic.plot(kind="bar", stacked="True", color =["#FF00B3", "lightblue"])
plt.legend(bbox_to_anchor=(1.02, 0.6), loc="upper left", borderaxespad=0, 
           labels=["Subjective", "Objective"])

## Save BERT Data

In [26]:
subset_data.to_csv("danews-sentiment-data-V2.csv")