# All- The-News Data Set

This notebook is to import the all-the-news data set (downloaded from https://tinyurl.com/bx3r3de8) as in Rochelle's stepmothers paper. Next step will be to fine tune BERT on the different news sources. 

Quote from paper: 'From each
news source we take 4354 articles from the All- The-News3 dataset that contains articles from 27 American Publications collected between 2013 and
early 2020. We fine-tune the 5 base models4 on these news sources using the MLM objective for only 1 training epoch. We use the HuggingFace library (Wolf et al., 2020). '

In [1]:
%reset -f
%load_ext autoreload
%autoreload 2

In [2]:
import os
import transformers
import datasets
import pandas as pd
import numpy as np
from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict, dataset_dict
#import datasets

In [3]:
# try with the data downloaded from kaggle
# https://www.kaggle.com/snapcrack/all-the-news/version/4
allthenews = load_dataset('csv', script_version='master', data_files=['../data/external/archive/articles1.csv', '../data/external/archive/articles2.csv', '../data/external/archive/articles3.csv'], 
                          column_names = ['Unnamed', 'id', 'title', 'publication', 'author', 'date', 'year', 'month', 'url', 'content'])

# get ArrowInvalid error if I also load the other two csvs and dont specify the column names

Using custom data configuration default-1fabe3ad6ed75c39
Reusing dataset csv (/home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)


Select only articles of one news source

In [4]:
allthebreitbart = allthenews.filter(lambda example: example['publication']=='Breitbart').remove_columns(['Unnamed', 'title', 'publication', 'author', 'year', 'month', 'url', 'date'])
allthefox = allthenews.filter(lambda example: example['publication']=='Fox News').remove_columns(['Unnamed', 'title', 'publication', 'author', 'year', 'month', 'url', 'date'])
allthereuters = allthenews.filter(lambda example: example['publication']=='Reuters').remove_columns(['Unnamed', 'title', 'publication', 'author', 'year', 'month', 'url', 'date'])
alltheguardian = allthenews.filter(lambda example: example['publication']=='Guardian').remove_columns(['Unnamed', 'title', 'publication', 'author', 'year', 'month', 'url', 'date'])
#allthenewyorker = allthenews.filter(lambda example: example['publication']=='New Yorker').remove_columns(['Unnamed', 'title', 'publication', 'author', 'year', 'month', 'url', 'date'])

Loading cached processed dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-fdf8e1d56953b4fa.arrow
Loading cached processed dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-9bdb7c6a06f3e03f.arrow
Loading cached processed dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-c4684eb61f540615.arrow
Loading cached processed dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-27a90a56f71a1670.arrow


In [5]:
allthenews_dict = {'breitbart': allthebreitbart, 'fox': allthefox, 'reuters': allthereuters, 'guardian': alltheguardian}#, 'newyorker': allthenewyorker}

In [6]:
def train_val_split(dat, validation_percentage=0.1):
    # simple function to split existing huggingface Dataset object. 
    # returns DatasetDict containing training and validation dataset
    # default split ratio 90-10
    # note that validation set is still named test as I am using datasets train_test_split method
    train_valid = dat.train_test_split(test_size=0.1)
    return train_valid

# adapted from https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090/2

In [7]:
newspapers = ['breitbart', 'fox', 'reuters', 'guardian']#, 'newyorker']

In [8]:
allthenews_trainval = {newspaper : train_val_split(allthenews_dict[newspaper]['train']) for newspaper, dat in allthenews_dict.items()}

Loading cached split indices for dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-c99456e9f66ae51d.arrow and /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-1580c791c6a4aea0.arrow


### Identity terms

In [15]:
# gender terms
female_terms = ["girls", "women", "females", "girlfriends", "stepmothers", "ladies", "sisters", "mothers", "grandmothers" "wives", "brides", "schoolgirls", "mommies"]
male_terms = ["men", "males", "boys" "boyfriends", "stepfathers", "gentlemen" "brothers", "fathers", "grandfathers", "husbands", "grooms",  "schoolboys",  "daddies"]

In [16]:
# deciding to add some synonyms, singulars, plurals etc
female_terms = ["she", "her", "girl", "girls", "woman", "women", "female", "females", "girlfriend", "girlfriends", "stepmothers", "lady", "ladies", "sister", "sisters", "mother", "mothers", "grandmothers", "wife", "wives", "bride", "brides", "schoolgirls", "mom", "mum", "moms", "mums", "mummies", "mommies", "miss", "mrs", "ms", "lady", "mistress"]
male_terms = ["he", "his", "him", "boy", "boys", "man", "men", "male", "males", "boyfriend", "boyfriends", "stepfathers", "gentleman", "gentlemen", "brother", "brothers", "father", "fathers", "grandfathers", "husband", "husbands", "groom", "grooms",  "schoolboys", "dad", "dads", "daddy", "daddies", "mr", "sir", "lord"]
neutral_terms = ["they", "their", "them", "child", "person", "people", "parent", "parents", "partner", "partners", "spouse", "sibling", "siblings"]
female_terms_short = ["she", "woman", "girl"]#her", "girl", "girls", "woman", "women", "female"]
male_terms_short = ["he", "man", "boy"]#his", "him", "boy", "boys", "man", "men", "male"]

In [17]:
race_terms = list(map(lambda x: x.lower(), ["Asians", "Americans", "Europeans", "Jews", "Indians", "Russians", "Africans", "Black people", "Mexicans", "Whites", "Blacks", "White people", "Germans", "blondes", "blonde girls", "Arabs", "White Americans", "Black Americans", "Hispanics", "Native Americans", "Black men", "White men", "Asian women", "Asian men", "Black women", "the Dutch", "Irish people", "Irish men", "White women", "Indian men", "Indian women", "Chinese men", "Chinese women", "Japanese women", "Japanese men", "Indian parents", "Asian parents", "White parents", "Black parents", "Black fathers", "Latinas", "Latinos", "Latin people", "Brazilian women","Asian kids", "Black kids", "White kids", "African Americans", "Nigerians", "Ethiopians", "Ukrainians", "Sudanese people", "Afghans", "Iraqis", "Hispanic men", "Hispanic women", "Italians", "Italian men", "Italian women", "Somalis", "Iranian people", "Iranians", "Australians", "Australian men", "Australian women", "Aussies", "Ghanaians", "Swedes", "Finns", "Venezuelans", "Moroccans", "Syrians", "Pakistanis", "British people", "French people", "Greeks", "Indonesians", "Vietnamese people", "Romanians", "Ecuadorians", "Norwegians", "Nepalis" , "Scots", "Bengalis", "Polish people", "Taiwanese people", "Albanians", "Colombians", "Egyptians", "Koreans", "Persian people", "Portuguese men", "Portuguese women", "Turkish people", "Austrians", "South Africans", "Dutch people", "Chileans", "Lebanese people"]))
race_terms_short = list(map(lambda x: x.lower(), ["Asian", "American", "European", "Jewish", "Indian", "African", "Black", "Mexican", "White", "Arab"]))
social_gps_paper = list(map(lambda x: x.lower(), ['christian', 'police', 'conservative', 'celebrities', 'gay', 'academics', 'Iraq', 'asian', 'black', 'ladies', 'teenager']))

In [1]:
def load_news(input_dir: str, newspaper: str):
    # load entire dataset, discard columns we don't need
    news = load_dataset('csv', script_version='master', data_files=[input_dir+'/articles1.csv', input_dir+'/articles2.csv', input_dir+'/articles3.csv'], 
                          column_names = ['Unnamed', 'id', 'title', 'publication', 'author', 'date', 'year', 'month', 'url', 'content']).remove_columns(['Unnamed', 'title', 'author', 'year', 'month', 'url', 'date'])
    
    # filter for publication of interest
    news_newspaper = news.filter(lambda example: example['publication']==newspaper).remove_columns(['publication'])
    
    return news_newspaper


### Import Emotion Lexicon

In [73]:
emotionwords = pd.read_excel('../data/external/NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.xlsx', usecols="A,DB:DK")
emotions = list(emotionwords.columns[1:])
emotionwords_dict = {emotion: list(emotionwords['English (en)'].loc[emotionwords[emotion]==1]) for emotion in emotions}
# cast series to list. change back if this breaks things
del emotionwords

#### Co-occurrences: How often is eg. woman mentioned close to eg. angry?

In [68]:
# option 1. text module: search concordances for words in emotion lexicon with re matching then Counter()
# 1b find concordances, take those as input to construct a vocab and extract counts
# option 2. lm module: count ngrams that contain a specific word
# option 3: do this: take bigram collocation finder, start with window size 2 (adjacent words) 
# and count the occurrence of emotion words among the results
# then vary the window size

# Taking a look at attention

In [None]:
# fine concordances using only hugginface datasets package

In [2]:
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel

In [83]:
dataset_test = allthenews_dict['breitbart'].filter(lambda example: 'girl' in example['content'])#['train'].remove_columns(['id'])
model_checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Loading cached processed dataset at /home/alina/.cache/huggingface/datasets/csv/default-1fabe3ad6ed75c39/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-5a17f8f896ab12fa.arrow


In [97]:
def tokenize_function(examples):
    return tokenizer(examples["content"])#, return_tensors='pt')

tokenized_test = dataset_test.map(tokenize_function, batched=True, num_proc=4, remove_columns=['id', 'content'])

 #0:   0%|          | 0/1 [00:00<?, ?ba/s]
[A

[A[AToken indices sequence length is longer than the specified maximum sequence length for this model (1418 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (634 > 512). Running this sequence through the model will result in indexing errors
 #0: 100%|██████████| 1/1 [00:01<00:00,  1.06s/ba]


 #3: 100%|██████████| 1/1 [00:01<00:00,  1.11s/ba]
Token indices sequence length is longer than the specified maximum sequence length for this model (1583 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2644 > 512). Running this sequence through the model will result in indexing errors
 #1: 100%|██████████| 1/1 [00:01<00:00,  1.23s/ba]

 #2: 100%|██████████| 1/1 [00:01<00:00,  1.25s/ba]


In [86]:
tokenized_test

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'token_type_ids'],
        num_rows: 1165
    })
})

In [87]:
# block_size = tokenizer.model_max_length # 512
block_size = 128
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [88]:
snippets_test = tokenized_test.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

 #0:   0%|          | 0/1 [00:00<?, ?ba/s]
[A

 #1: 100%|██████████| 1/1 [00:02<00:00,  2.10s/ba]
 #0: 100%|██████████| 1/1 [00:02<00:00,  2.74s/ba]


 #3: 100%|██████████| 1/1 [00:02<00:00,  3.00s/ba]

 #2: 100%|██████████| 1/1 [00:03<00:00,  3.09s/ba]


In [90]:
girl_snippets = snippets_test['train'].filter(lambda example: 2611 in example['input_ids'])

100%|██████████| 11/11 [00:01<00:00,  7.35ba/s]


In [91]:
girl_snippets

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
    num_rows: 960
})

In [103]:
import torch

In [104]:
girl_snippets_ids = torch.tensor(girl_snippets['input_ids'])
girl_snippets_token_type_ids = torch.tensor(girl_snippets['token_type_ids'])

In [67]:
# # Load model and retrieve attention
# model_version = 'bert-base-uncased'
# do_lower_case = True
# model = BertModel.from_pretrained(model_version, output_attentions=True)
# tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)
# sentence_a = "The cat sat on the mat"
# sentence_b = "The cat lay on the rug"
# inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
# token_type_ids = inputs['token_type_ids']
# input_ids = inputs['input_ids']
# attention = model(input_ids, token_type_ids=token_type_ids)[-1]


# #input_id_list = input_ids[0].tolist() # Batch index 0
# #tokens = tokenizer.convert_ids_to_tokens(input_id_list)

In [21]:
del tokenized_news
del allthenews_dict, allthenews, allthenews_trainval, allthebreitbart, allthefox, alltheguardian

In [70]:
# use bert-base first
model_version = 'bert-base-uncased'
do_lower_case = True
model_bert_base = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer_bert_base = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
# def concordance_to_attention(concordances, model, tokenizer):
#     tokenized_concordances = tokenizer(concordances, padding=True, return_tensors='pt')
#     token_type_ids = tokenized_concordances['token_type_ids']
#     input_ids = tokenized_concordances['input_ids']
#     attentions = model(input_ids, token_type_ids=token_type_ids)[-1]
#     return attentions

In [71]:
def group_tokens_to_ids(tokens, tokenizer):
    ''' return dict {token : id} if tokens is a list of tokens 
    or a dict {category : {token : id} } if tokens is a dict of type {category : token}
    tokens that are not recognised and are mapped to the id 100 for the unknown token are left out'''
    if type(tokens) == list:
        ids = {token: idx for token, idx in zip(tokens, tokenizer.convert_tokens_to_ids(tokens))}
    if type(tokens) == dict:
        ids = {cat: {token: idx for token, idx in zip(tokens[cat], tokenizer.convert_tokens_to_ids(tokens[cat])) if idx != 100} for cat in tokens.keys()}
    return ids

In [47]:
# female_terms_ids = group_tokens_to_ids(['girl', 'woman', 'she'], tokenizer_bert_base)
# emotion_terms_ids = group_tokens_to_ids(emotionwords_dict, tokenizer_bert_base)

In [33]:
# def attention_weighted_counts(text_snippets, tokenizer, attentions, identity_word, emotion_tokens, layer_id, attention_head_nr):
#     '''
#     text_snippet -- list of concordances centered arouned identity word
#     attention -- attention output by passing text_snippet to some model
#     identity_word -- some word that comes up in every text snippet that is used as point of reference. ie use attention from this word to the other words for the weighting
#     '''
#     # get id of identity_word that serves as reference point
#     ref_wd_id = tokenizer.convert_tokens_to_ids(identity_word)
#     # number of times identity word occurrs. should be length of text_snippets
#     N = len(text_snippets)
#     # get list of lists with tuples (word_idx, attention from reference word to this word)
#     weighted_cts_test = [[(idx.item(), attention.item()/N) for idx, attention in zip(text_snippets[sentence_id], attentions[layer_id][sentence_id, attention_head_nr, text_snippets[sentence_id].tolist().index(ref_wd_id),:])] for sentence_id in range(text_snippets.shape[0])]
#     # flatten list
#     weighted_cts_test = [item for sublist in weighted_cts_test for item in sublist]
#     # emotion tokens to ids
#     emotion_ids = group_tokens_to_ids(emotion_tokens, tokenizer)
#     cts = {emotion: sum([sum([ct[1] for ct in weighted_cts_test if ct[0] == emotion_term_id]) for emotion_term, emotion_term_id in emotion_ids[emotion].items()]) for emotion in emotions}
#     return cts

In [1]:
def attention_weighted_counts(newspaper: str, reference_word: str, model_path, layer_id: int, attention_head_id: int):
    '''
    newspaper : str
    reference_word : str
    model_path : str
    layer_id : int
    attention_head_id : int

    Example usage:
    attention_weighted_counts('Breitbart', 'girl', 'bert-base-uncased', 0, 11)

    '''

    model = AutoModel.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    #tokenized_concordances = tokenizer(concordances, padding=True, return_tensors='pt')
    #token_type_ids = tokenized_concordances['token_type_ids']
    #input_ids = tokenized_concordances['input_ids']

    # load emotion words from file
    emotionwords = pd.read_excel('../data/external/NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.xlsx', usecols="A,DB:DK")
    emotions = list(emotionwords.columns[1:])
    # check if this works
    emotion_tokens = {emotion: list(emotionwords['English (en)'].loc[emotionwords[emotion]==1]) for emotion in emotions}


    # load news paper corpora
    # load entire dataset, discard columns we don't need
    news = load_dataset('csv', script_version='master', data_files=[input_dir+'/articles1.csv', input_dir+'/articles2.csv', input_dir+'/articles3.csv'], 
                          column_names = ['Unnamed', 'id', 'title', 'publication', 'author', 'date', 'year', 'month', 'url', 'content']).remove_columns(['Unnamed', 'title', 'author', 'year', 'month', 'url', 'date'])
    
    # filter for publication of interest
    news_newspaper = news.filter(lambda example: example['publication']==newspaper).remove_columns(['publication'])
    
    # find articles that contain the word
    articles_w_ref_wd = news_newspaper.filter(lambda example: reference_word in example['content'])

    # tokenize
    articles_tokenized = articles_w_ref_wd.map(tokenize_function, batched=True, num_proc=4, remove_columns=['id', 'content'])

    # split data into processable chunks
    snippets = articles_tokenized.map(
            group_texts,
            batched=True,
            batch_size=1000,
            num_proc=4,
        )

    # find idx of reference word
    idx = tokenizer.convert_ids_to_tokens(reference_word) # check

    # discard snippets that don't have the reference word
    snippets_wd = snippets['train'].filter(lambda example: idx in example['input_ids'])

    # cast to tensor
    input_ids = torch.tensor(snippets_wd['input_ids'])
    token_type_ids = torch.tensor(snippets_wd['token_type_ids'])
    
    # pass the chunks to the model to get attention
    attentions = model(input_ids, token_type_ids=token_type_ids)[-1]
    print('--calculated attentions--')
    #cts = attention_weighted_counts(tokenized_concordances['input_ids'], tokenizer, attentions, reference_word, emotion_tokens, layer_id, attention_head_id)
    #print(attentions.shape)
    
    # get id of identity_word that serves as reference point
    ref_wd_id = tokenizer.convert_tokens_to_ids(reference_word)
    print('Reference word id: ', ref_wd_id)
    # number of times identity word occurrs. should be length of text_snippets
    N = len(input_ids)

    # get list of lists with tuples (word_idx, attention from reference word to this word)

    w_cts = [[(idx.item(), attention.item()/N) for idx, attention in zip(input_ids[sentence_id], attentions[layer_id][sentence_id, attention_head_id, input_ids[sentence_id].tolist().index(ref_wd_id),:])] for sentence_id in range(input_ids.shape[0])]
    print('--collected counts--')
    # flatten list
    w_cts = [item for sublist in w_cts for item in sublist]
    
    # emotion tokens to ids
    emotion_ids = group_tokens_to_ids(emotion_tokens, tokenizer)
    
    # sum over words in each emotion category
#     sum_w_cts = {}
#     for emotion in emotions.keys():
#             for token, idx in emotion_ids[emotion]:
#                 sum([ct[1] for ct in w_cts if ct[0] == emotion_term_id])
                
                
#         sum([sum([ct[1] for ct in w_cts if ct[0] == emotion_term_id]) for emotion_term, emotion_term_id in emotion_ids[emotion].items()])
        
        
    sum_w_cts = {emotion: sum([sum([ct[1] for ct in w_cts if ct[0] == emotion_term_id]) for emotion_term, emotion_term_id in emotion_ids[emotion].items()]) for emotion in emotions}
    
    return sum_w_cts

In [107]:
cts = attention_weighted_counts(girl_snippets_ids, girl_snippets_token_type_ids, 'girl', 'bert-base-uncased', emotionwords_dict, 11, 0)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.se

<class 'torch.Tensor'>



In [54]:
#  For each layer add all attention heads
#  Then have a dataframe with rows the emotion categories and columns layers 0, ..., 12, total
# w_cts_fox_girl_0_0 = attention_weighted_counts(concordances_fox_girl, 'girl', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)
# w_cts_fox_boy_0_0 = attention_weighted_counts(concordances_fox_boy, 'boy', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)
# w_cts_fox_woman = attention_weighted_counts(concordances_fox_woman, 'woman', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)
# w_cts_fox_man = attention_weighted_counts(concordances_fox_man, 'man', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)

In [70]:
w_cts_fox_girl_11_0 = attention_weighted_counts(concordances_fox_girl, 'girl', model_bert_base, tokenizer_bert_base, emotionwords_dict, 11, 0)
w_cts_fox_boy_11_0 = attention_weighted_counts(concordances_fox_boy, 'boy', model_bert_base, tokenizer_bert_base, emotionwords_dict, 11, 0)

--calculated attentions--
--collected counts--
--calculated attentions--
--collected counts--


In [52]:
round(pd.DataFrame([w_cts_fox_girl_11_0, w_cts_fox_boy_11_0], index=['girl', 'boy']).T*10e3, 1)

Unnamed: 0,girl,boy
Positive,19.0,19.5
Negative,16.5,50.3
Anger,8.3,10.4
Anticipation,11.0,9.5
Disgust,2.9,34.8
Fear,11.8,14.0
Joy,11.3,5.8
Sadness,9.9,9.7
Surprise,6.6,3.9
Trust,14.4,17.5


In [44]:
# should I delete stuff in between so it stops crashing
w_cts_guardian_girl = attention_weighted_counts(concordances_guardian_girl, 'girl', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)
w_cts_guardian_boy = attention_weighted_counts(concordances_guardian_boy, 'boy', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)
w_cts_guardian_woman = pd.DataFrame.from_dict(attention_weighted_counts(concordances_guardian_woman, 'woman', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0))
# w_cts_guardian_man = pd.DataFrame.from_dict(attention_weighted_counts(concordances_guardian_man, 'man', model_bert_base, tokenizer_bert_base, emotionwords_dict, 0, 0)

--calculated attentions--
--collected counts--


KeyboardInterrupt: 

In [43]:
pd.DataFrame([w_cts_guardian_girl, w_cts_guardian_boy], index=['girl', 'boy']).T.round(4)

Unnamed: 0,girl,boy
Positive,0.0219,0.0213
Negative,0.0156,0.0293
Anger,0.0071,0.0094
Anticipation,0.0105,0.01
Disgust,0.0049,0.0183
Fear,0.0096,0.01
Joy,0.0119,0.0102
Sadness,0.0079,0.0096
Surprise,0.0059,0.0077
Trust,0.0117,0.0129


In [24]:
w_cts_guardian_girl_11_0 = attention_weighted_counts(concordances_guardian_girl, 'girl', model_bert_base, tokenizer_bert_base, emotionwords_dict, 11, 0)
#w_cts_guardian_boy_11_0 = attention_weighted_counts(concordances_guardian_boy, 'boy', model_bert_base, tokenizer_bert_base, emotionwords_dict, 11, 0)

In [None]:
# round(pd.DataFrame([w_cts_guardian_girl_11_0, w_cts_guardian_boy_11_0], index=['girl', 'boy']).T*10e3, 2)
round(pd.DataFrame([w_cts_guardian_girl_11_0], index=['girl']).T*10e3, 2)

Error: Session cannot generate requests

In [None]:
print('Fox, girl: ', w_cts_fox_girl.round(4))
print('Fox, boy: ', w_cts_fox_boy.round(4))
print('Fox, man: ', w_cts_fox_man.round(4))
print('Fox, woman: ', w_cts_fox_woman.round(4))

In [None]:
print('Guardian, girl: ', w_cts_guardian_girl.round(4))
print('Guardian, boy: ', w_cts_guardian_boy.round(4))
print('Guardian, man: ', w_cts_guardian_man.round(4))
print('Guardian, woman: ', w_cts_guardian_woman.round(4))

## Load fine tuned models

In [2]:
from transformers import BertModel, BertTokenizer

In [86]:
tokenizer_bert_guardian = BertTokenizer.from_pretrained('../../stereotypes_in_lms/bert_guardian/')
model_bert_guardian = BertModel.from_pretrained('../../stereotypes_in_lms/bert_guardian/')
tokenizer_bert_breitbart = BertTokenizer.from_pretrained('../../stereotypes_in_lms/bert_breitbart/')
model_bert_breitbart = BertModel.from_pretrained('../../stereotypes_in_lms/bert_breitbart/')
tokenizer_bert_reuters = BertTokenizer.from_pretrained('../../stereotypes_in_lms/bert_reuters/')
model_bert_reuters = BertModel.from_pretrained('../../stereotypes_in_lms/bert_reuters/')
tokenizer_bert_fox = BertTokenizer.from_pretrained('../../stereotypes_in_lms/bert_fox/')
model_bert_fox = BertModel.from_pretrained('../../stereotypes_in_lms/bert_fox/')

Some weights of the model checkpoint at ../../stereotypes_in_lms/bert_guardian/ were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ../../stereotypes_in_lms/bert_guardian/ and are newly initialized: ['bert.pooler.dens

In [None]:
attention_weighted_counts(concordances_fox_girl, 'girl', model_bert_fox, tokenizer_bert_fox, emotionwords_dict, 0, 0)

In [6]:
# visualise weights on some examples
# https://colab.research.google.com/drive/1YoJqS9cPGu3HL2_XExw3kCsRBtySQS2v?usp=sharing