# Sentiment Analysis of DaNewsRoom
Sentiment Analysis using DaNLP's [BERT TONE](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md) for the Cultural Data Science Project 2022 by @drasbaek and @MinaAlmasi

Using the [DaNewsRoom dataset](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md)

## Import Packages & Mount Google Drive

In [None]:
# import packages for data import
import gzip 
import pandas as pd 

In [None]:
#progress bar 
!pip -q install tqdm ipywidgets
from tqdm import tqdm
import time

In [None]:
# mount google drive (if run from google colab)
from google.colab import drive
drive.mount("/content/drive")

In [None]:
#packages for sentiment analysis
!pip install -q pandas datasets danlp transformers

In [None]:
import matplotlib.pyplot as plt #import for plotting
import matplotlib as mpl #import for plotting high res

In [None]:
 #check GPU 
 !nvidia-smi -L

## Data import

In [None]:
#load in the pre-processed dataset 
filepath = "/content/drive/MyDrive/002 cultural-data-science/data/preprocessed_DaNewsRoom.csv"

#read data in chunks of 100.000 rows at a time 
chunk = pd.read_csv(filepath, chunksize=10000)
data = pd.concat(chunk)

In [None]:
data.head()

## BERT TONE CLASSIFICATION

Sentiment Analysis using DaNlP's pretrained bert model:
https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md

### Loading BERT Tone Model

In [None]:
from danlp.models import load_bert_tone_model

#load model 
classifier = load_bert_tone_model()

#### Using BERT for Sentiment Analysis

In [None]:
# using the classifier to get predictions (tone and sentiment)
predictions = [] #define empty list to be appended to in for loop

for i in tqdm(range(len(subset_data))):
  predictions.append(classifier.predict(subset_data["title"][i]))

In [None]:
# using the classifier to get probabilities for the categorisations
probabilities = [] 

for i in tqdm(range(len(subset_data))):
  probabilities.append(classifier.predict_proba(subset_data["title"][i]))

In [None]:
# checking the classes
classifier._classes()

In [None]:
import numpy as np

#make probabilities into seperate columns by splitting up the list of arrays with five values into five columns
probabilities_data = pd.DataFrame([array[0].tolist()+array[1].tolist() for array in probabilities],columns=['positive_probability','neutral_probability','negative_probability','objective_probability', "subjective_probability"])
probabilities_data

# convert BERT sentiment predictions into dataframe
predictions_data = pd.DataFrame(predictions)

#rename polarity into "sentiment"
predictions_data = predictions_data.rename(columns={"polarity": "sentiment"})

#combine all into final dataframe
subset_data = pd.concat([subset_data, predictions_data, probabilities_data], axis=1)

In [None]:
subset_data.head()

## Initial Plotting to Look at BERT's Predictions

### Plotting Sentiment Across Domains

In [None]:
# group sentiment per domain
sentiment_per_domain = subset_data.groupby(["domain", "sentiment"])["sentiment"].count()
sentiment_per_domain

In [None]:
# make sentiment values (positive, neutral, negative) into columns
sentiment_per_domain = sentiment_per_domain.unstack()
sentiment_per_domain

In [None]:
# convert into dataframe
sentiment_per_domain = pd.DataFrame(sentiment_per_domain)

Prepare Data for Stacked Barplot:

In [None]:
# create sum column
sentiment_per_domain["sum"] = sentiment_per_domain["negative"] + sentiment_per_domain["neutral"] + sentiment_per_domain["positive"]

# create proportion column
sentiment_per_domain["proportion_negative"] = sentiment_per_domain["negative"]/sentiment_per_domain["sum"]
sentiment_per_domain["proportion_neutral"] = sentiment_per_domain["neutral"]/sentiment_per_domain["sum"]
sentiment_per_domain["proportion_positive"] = sentiment_per_domain["positive"]/sentiment_per_domain["sum"]

# select only proportion columns for plot 
grouped = sentiment_per_domain[["proportion_negative", "proportion_neutral","proportion_positive"]]

Plot:

In [None]:
# define plot resolution and size
mpl.rcParams['figure.dpi'] = 150

plt.figure(figsize=(5,4))

# plot values
grouped.plot(kind="bar", stacked="True", color =["#ff7678", "#d7d7d7", "#a0ff9f"])
plt.legend(bbox_to_anchor=(1.02, 0.6), loc="upper left", borderaxespad=0, 
           labels=["Negative", "Neutral", "Positive"], 
           title = "Sentiment")

### Plotting Analytic (Tone) Across Domains

In [None]:
#group analytic by domain
analytic_per_domain = subset_data.groupby(["domain", "analytic"])["analytic"].count()

# make objective and subjective into columns
analytic_per_domain = analytic_per_domain.unstack() 

# convert into pandas dataframe 
analytic_per_domain = pd.DataFrame(analytic_per_domain)

Prepare Data for Stacked Barplot:

In [None]:
# create sum column
analytic_per_domain["sum"] = analytic_per_domain["objective"] + analytic_per_domain["subjective"] 

# create proportion column
analytic_per_domain["proportion_subjective"] = analytic_per_domain["subjective"]/analytic_per_domain["sum"]
analytic_per_domain["proportion_objective"] = analytic_per_domain["objective"]/analytic_per_domain["sum"]

# select only proportion columns for plot 
grouped_analytic = analytic_per_domain[["proportion_subjective", "proportion_objective"]]

Make Plot

In [None]:
#define resolution of plot and figure size 
mpl.rcParams['figure.dpi'] = 150
plt.figure(figsize=(5,4))

#plot values
grouped_analytic.plot(kind="bar", stacked="True", color =["#FF00B3", "lightblue"])
plt.legend(bbox_to_anchor=(1.02, 0.6), loc="upper left", borderaxespad=0, 
           labels=["Subjective", "Objective"])

## Save BERT Data

In [None]:
subset_data.to_csv("danews-sentiment-data-V2.csv")