# Sentiment Analys Transformer (RoBERTa Large)



This notebook is part of the deliverable of MSc. honours program of Maastricht University in 2022. 

### Project
The project In the framework of this project, we, the 5 students below, making up Team UniCon, spent 5 months consulting Capgemini Invent on Diversity and Inclusion in Management Consulting. For this purpose, our team has produced materials, including a guide, on the matter of gender norms in employee communication for Capgemini Invent's usage. Our goal is to enhance diversity by practicing more inclusive communication for the sake of employers as much as employees. 

### Team

**author and technical Lead**: [Justus-Jonas Erker](https://erker.ai)

**non-technical Team members**: 
- Charlotte Clerx 
- Johann Ferreira  
- Talea Grootenhuis
- Benjamin Huxoll


in cooperation with **Capgemini Invent** (represented by Miriam Cramer and Jarno Baumgartner)


#### Note
The 

In [None]:
# install transformer package
!pip install transformers

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# install packages
!pip install transformers
!pip install whatstk
!pip install sentencepiece

In [None]:
# Import required packages
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader, TensorDataset, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from scipy.special import softmax
from tqdm.notebook import tqdm

## Load Model
now we can load the multi-lingual Sentimen Analysis Transformer [cardiffnlp/twitter-xlm-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment)

In [None]:
# Preprocess link placeholders
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
#model.save_pretrained(MODEL)


text = ["Good night 😊", "Bad night"]
#text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

trainer = Trainer(model=model)

# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Downloading:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

1) Positive 0.7673
2) Neutral 0.2015
3) Negative 0.0313


In [None]:
# load model to GPU 
_ = model.to('cuda')

In [None]:
def calculate_scores(texts):
  """
  texts: list of all texts in dataframe
  return: confidence scores for positive, negative and neutral for all texts
  """
  confidence_dict = {
      'Positive': [],
      'Neutral': [],
      'Negative': []
  }

  for text in tqdm(texts):
    encoded_input = tokenizer(preprocess(text), return_tensors='pt').to('cuda')

    output = model(**encoded_input)

    score = output[0][0].cpu().detach().numpy()
    score = softmax(score)
    ranking = np.argsort(score)
    ranking = ranking[::-1]
    for i in range(score.shape[0]):
        l = config.id2label[ranking[i]]
        s = score[ranking[i]]
        confidence_dict[l].append(np.round(float(s), 4))


  print(len(confidence_dict['Positive']))
  return confidence_dict['Positive'], confidence_dict['Neutral'], confidence_dict['Negative']

## Loading Data and Calculating Sentiment Scores
Now we can load the data of the group as well as individual chats. we used WhatsApp as internal communication and therfore use the ```whatstk``` package to load the chats. Nonetheless, we have to filter Whatsapp specific strings in the chat as encryption information, attachments etc. 

We calculate the sentiment scores for every chat and store these in the loaded dataframe that we link to the corresponding participants dictionary which we will use later to 

In [None]:
import glob
import numpy as np
from whatstk.whatsapp.objects import WhatsAppChat
from whatstk.data import whatsapp_urls


a = [
     '‎Nachrichten und Anrufe sind Ende-zu-Ende-verschlüsselt. Niemand außerhalb dieses Chats kann sie lesen oder anhören, nicht einmal WhatsApp.',
      '‎Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them.',
     '<Medien ausgeschlossen>',
     'image omitted',
     'audio omitted',
     'Bild weggelassen'
]


starts_with_filter= [
                     '‎<attached:'               
]

people = ['B', 'C', 'T', 'E', 'J']
B = ['Benni' ]
E = ['Justus-Jonas Erker']
C = ['Charlotte Clerx', 'Charlotte  Clerx (Premium)']
T = ['Talea  Grootenhuis (Premium)', 'Talea Grootenhuis']
J = ['Johann', 'Johann Ferreira', '\u202a+31\xa06\xa033907035\u202c', '+27 82 616 2956']

chat = WhatsAppChat.from_source(filepath='/content/drive/MyDrive/Uni/Capgemini Premium Team/Sentiment Analysis/data/WhatsApp Chat mit The Capgemini Team.txt')
df_all = chat.df

df_all['username'] = df_all['username'].replace(to_replace=B,value='B')
df_all['username'] = df_all['username'].replace(to_replace=E,value='E')
df_all['username'] = df_all['username'].replace(to_replace=C,value='C')
df_all['username'] = df_all['username'].replace(to_replace=T,value='T')
df_all['username'] = df_all['username'].replace(to_replace=J,value='J')

df_all['count'] = df_all['message'].str.split().apply(len)
print(max(df_all['count']))

df_all = df_all[~df_all.message.isin(a)]

for x in starts_with_filter:
  df_all = df_all[~df_all['message'].astype(str).str.startswith(x)]
df_all.drop(df_all.index[df_all['count'] >100], inplace = True)

positive, neutral, negative = calculate_scores(df_all['message'].tolist())
df_all['positive'] = positive
df_all['neutral'] = neutral
df_all['negative'] = negative

current_names = np.unique(chat.df.username.to_numpy()).tolist()
print(current_names)
individual_chats = []

# tuples of (i,P) 
individual_chats_participants = {
    'B': [],
    'E': [],
    'C': [],
    'T': [],
    'J': [],
}



files = glob.glob('/content/drive/MyDrive/Uni/Capgemini Premium Team/Sentiment Analysis/data/individual chats/*')
names = []

for i, file in tqdm(enumerate(files)):
  print(file)
  chat = WhatsAppChat.from_source(filepath=file)
  df = chat.df
  
  for x in starts_with_filter:
    df = df[~df['message'].str.startswith(x)]
  df = df[~df.message.isin(a)]
  df['username'] = df['username'].replace(to_replace=B,value='B')
  df['username'] = df['username'].replace(to_replace=E,value='E')
  df['username'] = df['username'].replace(to_replace=C,value='C')
  df['username'] = df['username'].replace(to_replace=T,value='T')
  df['username'] = df['username'].replace(to_replace=J,value='J')

  # calculate scores
  positive, neutral, negative = calculate_scores(df['message'].tolist())
  df['positive'] = positive
  df['neutral'] = neutral
  df['negative'] = negative

  # append dataframe
  individual_chats.append(df)
  current_names = np.unique(df.username.to_numpy()).tolist()
  print(current_names)

  # append index of eeach individual
  individual_chats_participants[current_names[0]].append((i,current_names[1]))
  individual_chats_participants[current_names[1]].append((i,current_names[0]))
  
  names.extend(current_names)
np.unique(np.array(names))

In [None]:
# create groups of males and females
males = ['E','B','E']
females = ['T', 'C']

Calculating scores for each individual to the investigated groups

In [None]:
individuals = ['E','B','J','T', 'C']


for ind in individuals:
  postive = []
  negative = []
  neutral = []
  n_messages = []
  base = df_all.loc[df_all['username'] == ind]
  female_df = 1
  male_df = 1
  for index, chat_partner in individual_chats_participants[ind]:
    df = individual_chats[index]
    df = df.loc[df['username'] == ind]
    if chat_partner in males:
      if type(male_df) == int:
        male_df = df
      else:
        male_df = male_df.append(df)
    else:
      if type(female_df) == int:
        female_df = df
      else:
        female_df=female_df.append(df)
    base = base.append(df)


  female_pos = np.mean(female_df['positive'])
  female_neg = np.mean(female_df['negative'])
  female_neu = np.mean(female_df['neutral'])

  male_pos = np.mean(male_df['positive'])
  male_neg = np.mean(male_df['negative'])
  male_neu = np.mean(male_df['neutral'])

  total_pos = np.mean(base['positive']) 
  total_neg = np.mean(base['negative']) 
  total_neu = np.mean(base['neutral']) 
  
  print(ind)
  
  print(f'Total +: {total_pos}')
  print(f'Total -: {total_neg}')
  print(f'Total /: {total_neu}')

  print(f'female +: {female_pos}')
  print(f'female -: {female_neg}')
  print(f'female /: {female_neu}')

  print(f'male +: {male_pos}')
  print(f'male -: {male_neg}')
  print(f'male /: {male_neu}')

  print('\n \n \n ')








calculate score for each the communication between the groups as well as the baseline

In [None]:
individuals = ['E','B','E','T', 'C']
males = ['E','B','E']
females = ['T', 'C']

female_to_female_df = 1
female_to_male_df = 1
male_to_female_df = 1
male_to_male_df = 1
base = 1
female_base = 1
male_base = 1

for ind in individuals:
  postive = []
  negative = []
  neutral = []
  n_messages = []
  base = df_all.loc[df_all['username'] == ind]
  if ind in males:
    if type(male_base) == int:
        male_base = base
    else:
        male_base = male_base.append(base)
  
  else:
    if type(female_base) == int:
        female_base = base
    else:
        female_base = female_base.append(base)

  
  
  female_df = 1
  male_df = 1
  for index, chat_partner in individual_chats_participants[ind]:
    df = individual_chats[index]
    df = df.loc[df['username'] == ind]

    if chat_partner in males and ind in males:
      if type(male_to_male_df) == int:
        male_to_male_df = df
      else:
        male_to_male_df = male_to_male_df.append(df)

    elif chat_partner in females and ind in males:
      if type(male_to_female_df) == int:
        male_to_female_df = df
      else:
        male_to_female_df=male_to_female_df.append(df)

    elif chat_partner in females and ind in females:
      if type(female_to_female_df) == int:
        female_to_female_df = df
      else:
        female_to_female_df=female_to_female_df.append(df)

    elif chat_partner in males and ind in females:
      if type(female_to_male_df) == int:
        female_to_male_df = df
      else:
        female_to_male_df=female_to_male_df.append(df)

    if ind in males:
          male_base = male_base.append(df)
    
    else:
          female_base = female_base.append(df)



female_to_female_pos = np.mean(female_to_female_df['positive'])
female_to_female_neg = np.mean(female_to_female_df['negative'])
female_to_female_neu = np.mean(female_to_female_df['neutral'])

female_to_male_pos = np.mean(female_to_male_df['positive'])
female_to_male_neg = np.mean(female_to_male_df['negative'])
female_to_male_neu = np.mean(female_to_male_df['neutral'])

male_to_female_pos = np.mean(male_to_female_df['positive'])
male_to_female_neg = np.mean(male_to_female_df['negative'])
male_to_female_neu = np.mean(male_to_female_df['neutral'])

male_to_male_pos = np.mean(male_to_male_df['positive'])
male_to_male_neg = np.mean(male_to_male_df['negative'])
male_to_male_neu = np.mean(male_to_male_df['neutral'])

male_pos = np.mean(male_base['positive'])
male_neg = np.mean(male_base['negative'])
male_neu = np.mean(male_base['neutral'])

female_pos = np.mean(female_base['positive']) 
female_neg = np.mean(female_base['negative']) 
female_neu = np.mean(female_base['neutral']) 
  

print(f'Female to Female +: {female_to_female_pos}')
print(f'Female to Female -: {female_to_female_neg}')
print(f'Female to Female /: {female_to_female_neu}')
print('\n \n')

print(f'Female to male +: {female_to_male_pos}')
print(f'Female to male -: {female_to_male_neg}')
print(f'Female to male /: {female_to_male_neu}')
print('\n \n')

print(f'male to Female +: {male_to_female_pos}')
print(f'male to Female -: {male_to_female_neg}')
print(f'male to Female /: {male_to_female_neu}')
print('\n \n')

print(f'male to male +: {male_to_male_pos}')
print(f'male to male -: {male_to_male_neg}')
print(f'male to male /: {male_to_male_neu}')
print('\n \n')

print(f'female base +: {female_pos}')
print(f'female base -: {female_neg}')
print(f'female base /: {female_neu}')
print('\n \n')

print(f'male base +: {male_pos}')
print(f'male base -: {male_neg}')
print(f'male base /: {male_neu}')

print('\n \n \n ')

Female to Female +: 0.37939367088607595
Female to Female -: 0.2257164556962025
Female to Female /: 0.39488607594936714

 

Female to male +: 0.29350434782608703
Female to male -: 0.30016195652173905
Female to male /: 0.4063282608695652

 

male to Female +: 0.318204716981132
male to Female -: 0.2229707547169812
male to Female /: 0.4588264150943396

 

male to male +: 0.33991111111111116
male to male -: 0.10382222222222223
male to male /: 0.5562888888888887

 

female base +: 0.32034768649669454
female base -: 0.18808432483474977
female base /: 0.4915696883852694

 

male base +: 0.3983480591497228
male base -: 0.16317707948243976
male base /: 0.4384731977818861

 
 
 


subtracting baseline to see the deviation between the groups

In [None]:
print('with subtracting base')

print(f'Female to Female +: {female_to_female_pos-female_pos}')
print(f'Female to Female /: {female_to_female_neu-female_neu}')
print(f'Female to Female -: {female_to_female_neg-female_neg}')
print('\n \n')

print(f'Female to male +: {female_to_male_pos-female_pos}')
print(f'Female to male /: {female_to_male_neu-female_neu}')
print(f'Female to male -: {female_to_male_neg-female_neg}')
print('\n \n')

print(f'male to Female +: {male_to_female_pos-male_pos}')
print(f'male to Female /: {male_to_female_neu-male_neu}')
print(f'male to Female -: {male_to_female_neg-male_neg}')
print('\n \n')

print(f'male to male +: {male_to_male_pos-male_pos}')
print(f'male to male /: {male_to_male_neu-male_neu}')
print(f'male to male -: {male_to_male_neg-male_neg}')
print('\n \n')


with subtracting base
Female to Female +: 0.05904598438938141
Female to Female /: -0.09668361243590223
Female to Female -: 0.03763213086145273

 

Female to male +: -0.02684333867060751
Female to male /: -0.08524142751570418
Female to male -: 0.11207763168698928

 

male to Female +: -0.08014334216859081
male to Female /: 0.020353217312453487
male to Female -: 0.05979367523454143

 

male to male +: -0.05843694803861166
male to male /: 0.11781569110700263
male to male -: -0.059354857260217525

 

