# Features engineering

This script implements the features proposed in

Spinde, T.; Rudnitckaia, L.; Mitrović, J.; Hamborg, F.; Granitzer, M.; Gipp, B. & Donnay, K.
Automated identification of bias inducing words in news articles using linguistic and context-oriented features 
Information Processing & Management, Elsevier, 2021, 58, 102505

In this script, we:
- pre-process the sentences,
- create features,
- prepare data for further ML training (e.g., one-hot encoding, etc.)

Dictionaries needed to run the code (only biased lexicon is attached): 
- LIWC2015 dictionary (J. W. Pennebaker, R. L. Boyd, K. Jordan and K. Blackburn, "The Development and Psychometric Properties of LIWC2015," 2015.)
- sentiment and subjectivity lexicon (T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis," in HLT/EMNLP, 2005. https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/)
- sentiment lexicon (M. Hu and B. Liu, "Mining and Summarizing Customer Reviews," in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Seattle, Washington, 2004. https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
- assertive_verbs (Hooper, J. B. (1974). On assertive predicates. Bloomington: Indiana University Linguistics Club.)
- factive_verbs (Hooper, J. B. (1974). On assertive predicates. Bloomington: Indiana University Linguistics Club.)
- implicative_verbs (Karttunen, L. (1971). Implicative Verbs. Language, 47(2), 340-358. doi:10.2307/412084)
- report_verbs (https://en.wiktionary.org/wiki/Category:English_reporting_verbs, https://www.ef.com/in/english-resources/english-grammar/reporting-verbs/, https://www.adelaide.edu.au/writingcentre/sites/default/files/docs/learningguide-verbsforreporting.pdf)
- hyperbolic terms (A. Chakraborty, B. Paranjape, S. Kakar and N. Ganguly, "Stop Clickbait: Detecting and preventing clickbaits in online news media," 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 9-16, 2016.)
- attitude markers (Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing.)
- boosters (Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing.)
- hedges (Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing.)
- kill verbs (Greene, S., & Resnik, P. (2009). More than Words: Syntactic Packaging and Implicit Sentiment. HLT-NAACL.)
- MRCP_concretness_ratings (MRC database: https://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm)
- MRCP_Imagability_ratings (MRC database: https://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm)
- bised_words_lexicon.xlsx

In [1]:
# data
import pandas as pd
import numpy as np
import csv

# misc
import os
import re
import time
import ast
import warnings
import math
import copy

# nlp
import spacy
import en_core_web_sm
nlp_spacy_core_web_lg = en_core_web_sm.load()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder

In [2]:
from google.colab import drive
drive.mount('/content/drive')
os.chdir('drive/MyDrive/media-bias-detection')

#dt = pd.read_excel('data/final_labels_SG1.xlsx')
dt = pd.read_excel('data/final_labels_SG2.xlsx')
dt.rename(columns={'text': 'sentence', 'biased_words': 'biased_words2'}, inplace=True)
dt = dt[dt['biased_words2'].isna()==False]
dt.reset_index(inplace=True)
dt = dt[['sentence',
         'outlet',
         'topic',
         'type',
         'biased_words2']]
dt["biased_words2"] = dt.biased_words2.apply(lambda s: list(ast.literal_eval(s)))

dt.head(3)

Mounted at /content/drive


Unnamed: 0,sentence,outlet,topic,type,biased_words2
0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[]
1,"""We have one beautiful law,"" Trump recently sa...",Alternet,gun control,left,"[bizarre, characteristically]"
2,"...immigrants as criminals and eugenics, all o...",MSNBC,white-nationalism,left,"[criminals, fringe, extreme]"


## 1 Tokenization, POS tagging, lemmatization, syntactic dependencies, NER

Info on SpaCy objects:
- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple UPOS part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalization, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [3]:
# run just once - adjust pipeline
# deactivate splitting to sentences. We have 1 sentence always, no need to introduce extra errors
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        doc[i+1].is_sent_start = False
    return doc

nlp_spacy_core_web_lg.add_pipe(custom_sentencizer, before="tagger")

In [4]:
# tokenization 
dt["spacy_lg"] = dt["sentence"].apply(lambda x: nlp_spacy_core_web_lg(x))
dt["spacy_lg_dict"] = None # to be filled in the next section

## 2 TF-IDF, dictionaries, bias lexicon

### 2.1 Create TF-IDF matrix, upload dictionaries, assign TF-IDF and dictionaries labels to each word (except those dictionaries that include more than 1 token)

In [5]:
# calculate tf-idf matrix for corpus of articles
dt = dt.replace({np.nan: None})
# len(data_1sent[data_1sent['article'].isna()]) # mostly articles from usa today are lost
# for the articles that couldn't be scraped, use sentence 
dt['article'] = dt['sentence']
corpus_art = dt[['sentence','article']]
corpus_art.drop_duplicates(subset = ["article"], keep = 'first', inplace = True)
corpus_art = corpus_art.reset_index()
corpus_art = corpus_art.rename(columns={"sentence": "sentence_", "article": "article_", "index": "index_"})

corpus_art_list = list(corpus_art['article_'])


tfidf = TfidfVectorizer(token_pattern = r"(?u)\b\w+\b")
x_art = tfidf.fit_transform(corpus_art_list)

df_tfidf_art = pd.DataFrame(x_art.toarray(), columns=tfidf.get_feature_names())

corpus_art = pd.merge(corpus_art, df_tfidf_art, left_index=True, right_index=True, how='left')
df_tfidf_art_updated = pd.merge(dt['article'], corpus_art, left_on='article',
                                right_on='article_', how='left')

df_tfidf_art_updated.head(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,article_x,index_,sentence_,article_,0,000,000m,03,1,10,100,1000,102,103,107,11,110,113,115,12,120,124,12th,13,130,134,135,14,15,150,157,15th,16,1600,1600s,1619,1639,17,171,175,...,yoga,york,yorkers,you,yougov,young,younger,youngest,youngsters,your,youth,youthful,youtube,yuan,z,zandi,zeal,zealand,zealot,zealotry,zeitung,zelaya,zero,zeroed,zerohedge,zhong,zillionth,zimmerman,zip,zirin,zirun,zoe,zohra,zone,zones,zoom,zoos,zubayer,zuckerberg,über
0,"""Orange Is the New Black"" star Yael Stone is r...",0,"""Orange Is the New Black"" star Yael Stone is r...","""Orange Is the New Black"" star Yael Stone is r...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"""We have one beautiful law,"" Trump recently sa...",1,"""We have one beautiful law,"" Trump recently sa...","""We have one beautiful law,"" Trump recently sa...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"...immigrants as criminals and eugenics, all o...",2,"...immigrants as criminals and eugenics, all o...","...immigrants as criminals and eugenics, all o...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# upload liwc2015 dictionary and codes
liwc2015 = pd.read_excel("data/bias_related_lexicons/LIWC2015.xlsx", sheet_name='lexica')
#                        converters={'codes':ast.literal_eval})
liwc2015_codes = pd.read_excel("data/bias_related_lexicons/LIWC2015.xlsx", sheet_name='psychological_codes')
liwc2015['codes_list'] = liwc2015[['c1', 'c2', 'c3', 'c4', 'c5',
                                   'c6', 'c7', 'c8', 'c9', 'c10']].values.tolist()
liwc2015['codes_list_clean'] = liwc2015.apply(lambda row: [int(x) for x in row['codes_list'] if math.isnan(x) == False],
                                              axis=1)
liwc2015 = liwc2015[['word','codes_list_clean']]
liwc2015 = liwc2015.rename(columns={"codes_list_clean": "codes"})

In [7]:
# sentiment and subjectivity lexicons
subjectivity_lexicon = pd.read_excel("data/bias_related_lexicons/subjectivity_lexicon.xlsx")
weak_subj = list(set(subjectivity_lexicon[subjectivity_lexicon['type']=='weaksubj']['word']))
strong_subj = list(set(subjectivity_lexicon[subjectivity_lexicon['type']=='strongsubj']['word']))
negative_wilson = list(set(subjectivity_lexicon[subjectivity_lexicon['priorpolarity']=='negative']['word']))
positive_wilson = list(set(subjectivity_lexicon[subjectivity_lexicon['priorpolarity']=='positive']['word']))

negative_bing = list(pd.read_excel("data/bias_related_lexicons/opinion_lexicon_english/negative-words.xlsx", header=None)[0])
positive_bing = list(pd.read_excel("data/bias_related_lexicons/opinion_lexicon_english/positive-words.xlsx", header=None)[0])

# concatenation of sentiment lexicons
positive_conc = list(set(positive_wilson + positive_bing))
negative_conc = list(set(negative_wilson + negative_bing))

print('positive: bing:',len(positive_bing),'wilson:',len(positive_wilson),'concatenated:',len(positive_conc))
print('negative: bing:',len(negative_bing),'wilson:',len(negative_wilson),'concatenated:',len(negative_conc))

both_conc = [value for value in negative_conc if value in positive_conc]
both_bing = [value for value in positive_bing if value in negative_bing]
both_wilson = [value for value in positive_wilson if value in negative_wilson]
print('overlap between concatenated positive and negative lists:', len(both_conc))
print("overlap between bing's positive and negative lists:", len(both_bing))
print("overlap between wilson's positive and negative lists:", len(both_wilson))
# what to do with overlap? maybe annotate them manually? or according to sentiwordnet (code in old.ipynb)

positive: bing: 2005 wilson: 2304 concatenated: 2757
negative: bing: 4783 wilson: 4154 concatenated: 5112
overlap between concatenated positive and negative lists: 36
overlap between bing's positive and negative lists: 3
overlap between wilson's positive and negative lists: 6


In [8]:
# other lexical features
assertive_verbs = list(pd.read_excel("data/bias_related_lexicons/assertive_verbs.xlsx", header=None)[0])
factive_verbs = list(pd.read_excel("data/bias_related_lexicons/factive_verbs.xlsx", header=None)[0])
implicative_verbs = list(pd.read_csv("data/bias_related_lexicons/implicative_verbs.csv", header=None)[1])
report_verbs1 = list(pd.read_excel("data/bias_related_lexicons/report_verbs.xlsx", header=None, sheet_name = 'wiki')[0])
report_verbs2 = list(pd.read_excel("data/bias_related_lexicons/report_verbs.xlsx", header=None, sheet_name = 'ef')[0])
report_verbs3 = list(pd.read_excel("data/bias_related_lexicons/report_verbs.xlsx", header=None, sheet_name = 'adelaide')[0])
report_verbs = list(set(report_verbs1+report_verbs2+report_verbs3))
hyperbolic_terms = list(pd.read_csv("data/bias_related_lexicons/hyperbolic_terms.csv", header=None)[0])
attitude_markers = list(pd.read_excel("data/bias_related_lexicons/attitude_markers.xlsx", header=None)[0])
boosters = list(pd.read_excel("data/bias_related_lexicons/boosters.xlsx", header=None)[0])
hedges = list(pd.read_excel("data/bias_related_lexicons/hedges.xlsx", header=None)[0])
kill_verbs = list(pd.read_excel("data/bias_related_lexicons/kill_verbs_nominalizations.xlsx", header=None)[0])

# MRCP ratings
MRCP_concretness_ratings = pd.read_excel("data/bias_related_lexicons/MRCP_concretness_ratings.xlsx")
MRCP_concretness_ratings = MRCP_concretness_ratings.drop_duplicates(subset=['WORD'])
MRCP_concretness_ratings["WORD"] = MRCP_concretness_ratings["WORD"].apply(lambda x: x.lower())
MRCP_concretness_ratings = MRCP_concretness_ratings.reset_index()
MRCP_concretness_ratings = MRCP_concretness_ratings[["WORD", "CNC"]]

MRCP_Imagability_ratings = pd.read_excel("data/bias_related_lexicons/MRCP_Imagability_ratings.xlsx")
MRCP_Imagability_ratings = MRCP_Imagability_ratings.drop_duplicates(subset=['WORD'])
MRCP_Imagability_ratings["WORD"] = MRCP_Imagability_ratings["WORD"].apply(lambda x: x.lower())
MRCP_Imagability_ratings = MRCP_Imagability_ratings.reset_index()
MRCP_Imagability_ratings = MRCP_Imagability_ratings[["WORD", "IMG"]]


#bias_lexicon = list(pd.read_excel("bias_word_lexicon_top100_10times.xlsx", header=None, 
#                                  sheet_name='top100_10times')[0])
bias_lexicon = list(pd.read_excel("data/bias_related_lexicons/bias_word_lexicon.xlsx", header=None)[0])

In [9]:
dt = dt.loc[dt['spacy_lg'].map(len) > 1,]

In [10]:
# features names
feats = ['text', 'text_low', 'pos', 'lemma', 'lemma_low', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop',
         'has_vec', 'glove_vec300', 'glove_vec300_norm', 'is_oov', 'order', 'is_ne', 'ne_label',
         'tfidf_art', 'liwc2015', 
         'negative_conc', 'positive_conc',
         'weak_subj', 'strong_subj',
         'MRCP_concretness_ratings', 'MRCP_Imagability_ratings',
         'hyperbolic_terms', 'attitude_markers', 'kill_verbs', 'bias_lexicon']

# 'assertive_verbs', 'factive_verbs', 'implicative_verbs', 'report_verbs', 'boosters', 'hedges',
feats_C1 = [a + b for a, b in zip(feats, len(feats)*['_C1'])]
feats_C2 = [a + b for a, b in zip(feats, len(feats)*['_C2'])]
feats_C3 = [a + b for a, b in zip(feats, len(feats)*['_C3'])]
feats_C4 = [a + b for a, b in zip(feats, len(feats)*['_C4'])]

start_time = time.time()
# iterate over each sentence
for index, row in dt.iterrows():
    #biased = row.biased_words

    sent_tokens = []
    
    # create a list of NEs in a sentence
    entities = []
    for ent in dt['spacy_lg'][index].ents:
        ent_dict = {'text': ent.text,
                    'label': ent.label_}
        #print((ent.text, ent.start_char, ent.end_char, ent.label_))
        entities.append(ent_dict)

    # add linguistic features for each token
    order = 0
    for token in dt['spacy_lg'][index]:
        if token.pos_ not in ['PUNCT', 'SPACE', 'SYM', 'CCONJ', 'PART', 'NUM']:
        #if token.pos_ != 'PUNCT' and token.pos_ != 'SPACE' and token.pos_ != 'SYM' and token.pos_ != 'CCONJ' and token.pos_ != 'PART':
            order += 1
            if token.text.lower() in list(df_tfidf_art_updated):
                tfidf_art = df_tfidf_art_updated[token.text.lower()][index]
            else:
                tfidf_art = None 
                
            if token.text in row.biased_words2:
                label2 = 1
            else:
                label2 = 0
            # if token.text in row.biased_words4:
            #     label4 = 1
            # else:
            #     label4 = 0                
            # if token.text in row.biased_words5:
            #     label5 = 1
            # else:
            #     label5 = 0
            
            token_dict={'text':token.text,
                    'text_low': token.text.lower(),
                    'pos': token.pos_,
                    'lemma': token.lemma_,
                    'lemma_low': token.lemma_.lower(),
                    'tag': token.tag_,
                    'dep': token.dep_,
                    'shape': token.shape_,
                    'is_alpha': token.is_alpha,
                    'is_stop': token.is_stop,
                    'has_vec':token.has_vector,
                    'glove_vec300':token.vector,
                    'glove_vec300_norm':token.vector_norm,
                    'is_oov':token.is_oov,
                    'order': (order - 1),
                    'tfidf_art': tfidf_art,
                    'label3': label2,
                    'label4': label2,
                    'label5': label2}
    
            # check if a token is NE, if yes: add is_ne=True and type of NE
            for ent in entities:
                if token.text in ent['text']:
                    token_dict['is_ne']=True
                    token_dict['ne_label']=ent['label']
            if 'is_ne' not in token_dict:
                token_dict['is_ne']=False
                token_dict['ne_label']=None

    # dictionaries with 1 token
            # LIWC2015 FEATURES
            if token.text.lower() in list(liwc2015.word):
                token_dict['liwc2015'] = list(liwc2015[liwc2015['word']==token.text.lower()]['codes'])[0]
            elif token.lemma_ in list(liwc2015.word):
                token_dict['liwc2015'] = list(liwc2015[liwc2015['word']==token.lemma_]['codes'])[0]
            else:
                token_dict['liwc2015'] = []
                
            # concatenated negative list
            if token.text.lower() in negative_conc:
                token_dict['negative_conc'] = 1
            elif token.lemma_.lower() in negative_conc:
                token_dict['negative_conc'] = 1
            else:
                token_dict['negative_conc'] = 0
                
            # concatenated positive list
            if token.text.lower() in positive_conc:
                token_dict['positive_conc'] = 1
            elif token.lemma_.lower() in positive_conc:
                token_dict['positive_conc'] = 1
            else:
                token_dict['positive_conc'] = 0
                
            # weak subjectivity list
            if token.text.lower() in weak_subj:
                token_dict['weak_subj'] = 1
            elif token.lemma_.lower() in weak_subj:
                token_dict['weak_subj'] = 1
            else:
                token_dict['weak_subj'] = 0
                
            # strong subjectivity list
            if token.text.lower() in strong_subj:
                token_dict['strong_subj'] = 1
            elif token.lemma_.lower() in strong_subj:
                token_dict['strong_subj'] = 1
            else:
                token_dict['strong_subj'] = 0

            # MRCP concretness ratings
            if token.text.lower() in list(MRCP_concretness_ratings['WORD']):
                rating = list(MRCP_concretness_ratings[MRCP_concretness_ratings['WORD']==token.text.lower()]['CNC'])[0]
                token_dict['MRCP_concretness_ratings'] = rating
            elif token.lemma_.lower() in list(MRCP_concretness_ratings['WORD']):
                rating = list(MRCP_concretness_ratings[MRCP_concretness_ratings['WORD']==token.lemma_.lower()]['CNC'])[0]
                token_dict['MRCP_concretness_ratings'] = rating
            else:
                token_dict['MRCP_concretness_ratings'] = None
                
            # MRCP Imagability ratings
            if token.text.lower() in list(MRCP_Imagability_ratings['WORD']):
                rating = list(MRCP_Imagability_ratings[MRCP_Imagability_ratings['WORD']==token.text.lower()]['IMG'])[0]
                token_dict['MRCP_Imagability_ratings'] = rating
            elif token.lemma_.lower() in list(MRCP_Imagability_ratings['WORD']):
                rating = list(MRCP_Imagability_ratings[MRCP_Imagability_ratings['WORD']==token.lemma_.lower()]['IMG'])[0]
                token_dict['MRCP_Imagability_ratings'] = rating
            else:
                token_dict['MRCP_Imagability_ratings'] = None

            # hyperbolic terms
            if token.text.lower() in hyperbolic_terms:
                token_dict['hyperbolic_terms'] = 1
            elif token.lemma_.lower() in hyperbolic_terms:
                token_dict['hyperbolic_terms'] = 1
            else:
                token_dict['hyperbolic_terms'] = 0

            # attitude markers
            if token.text.lower() in attitude_markers:
                token_dict['attitude_markers'] = 1
            elif token.lemma_.lower() in attitude_markers:
                token_dict['attitude_markers'] = 1
            else:
                token_dict['attitude_markers'] = 0
            
            # kill verbs + nouns
            if token.text.lower() in kill_verbs:
                token_dict['kill_verbs'] = 1
            elif token.lemma_.lower() in kill_verbs:
                token_dict['kill_verbs'] = 1
            else:
                token_dict['kill_verbs'] = 0
                
            # bias lexicon
            if token.text.lower() in bias_lexicon:
                token_dict['bias_lexicon'] = 1
            elif token.lemma_.lower() in bias_lexicon:
                token_dict['bias_lexicon'] = 1
            else:
                token_dict['bias_lexicon'] = 0
                
                
            # merge everyting in list of tokens and their properties
            sent_tokens.append(token_dict)
    
    
    # context
    for token in sent_tokens:
        # the 1st word in the sentence
        if token['order'] == 0:
            for feat in feats_C1: # context word -2
                token[feat]=None
            for feat in feats_C2: # context word -1
                token[feat]=None
            for i, feat in enumerate(feats_C3): # context word +1
                token[feat]=sent_tokens[token['order']+1][feats[i]]
            for i, feat in enumerate(feats_C4): # context word +2
                token[feat]=sent_tokens[token['order']+2][feats[i]]
        # the 2nd word in the sentence
        elif token['order'] == 1:
            for feat in feats_C1: # context word -2
                token[feat]=None
            for i, feat in enumerate(feats_C2): # context word -1
                token[feat]=sent_tokens[token['order']-1][feats[i]]
            for i, feat in enumerate(feats_C3): # context word +1
                token[feat]=sent_tokens[token['order']+1][feats[i]]
            for i, feat in enumerate(feats_C4): # context word +2
                token[feat]=sent_tokens[token['order']+2][feats[i]]
        # the pre-last word in the sentence
        elif token['order'] == (len(sent_tokens)-2):
            for i, feat in enumerate(feats_C1): # context word -2
                token[feat]=sent_tokens[token['order']-2][feats[i]]
            for i, feat in enumerate(feats_C2): # context word -1
                token[feat]=sent_tokens[token['order']-1][feats[i]]
            for i, feat in enumerate(feats_C3): # context word +1
                token[feat]=sent_tokens[token['order']+1][feats[i]]
            for i, feat in enumerate(feats_C4): # context word +2
                token[feat]=None
        # the last word in the sentence
        elif token['order'] == (len(sent_tokens)-1):
            for i, feat in enumerate(feats_C1): # context word -2
                token[feat]=sent_tokens[token['order']-2][feats[i]]
            for i, feat in enumerate(feats_C2): # context word -1
                token[feat]=sent_tokens[token['order']-1][feats[i]]
            for i, feat in enumerate(feats_C3): # context word +1
                token[feat]=None
            for i, feat in enumerate(feats_C4): # context word +2
                token[feat]=None
        # in other cases:
        else:
            for i, feat in enumerate(feats_C1): # context word -2
                token[feat]=sent_tokens[token['order']-2][feats[i]]
            for i, feat in enumerate(feats_C2): # context word -1
                token[feat]=sent_tokens[token['order']-1][feats[i]]
            for i, feat in enumerate(feats_C3): # context word +1
                token[feat]=sent_tokens[token['order']+1][feats[i]]
            for i, feat in enumerate(feats_C4): # context word +2
                token[feat]=sent_tokens[token['order']+2][feats[i]]            
        
    
    # update column spacy_lg_dict
    dt.at[index,'spacy_lg_dict'] = sent_tokens
    
end_time = time.time()
print('Time to create features for each word:', round((end_time-start_time),2), 'seconds')

Time to create features for each word: 1061.91 seconds


### 2.2 Handle dictionaries that include more than 1 token

In [11]:
# add windows of -4 and +4 words and lemmas
for index, row in dt.iterrows():
    print(index)
    print(len(row.spacy_lg_dict))
    if len(row.spacy_lg_dict) >= 8:
        for i, token in enumerate(row.spacy_lg_dict):
            print(i, token['text'])
            if i == 0:
                print('it chooes 0')
                token['window_text'] = [None, None, None, None, token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],row.spacy_lg_dict[i+4]['text_low']]
                token['window_lemma'] = [None, None, None, None, token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],row.spacy_lg_dict[i+4]['lemma_low']]
            elif i == 1:
                print('it chooes 1')
                token['window_text'] = [None, None, None, row.spacy_lg_dict[i-1]['text_low'], token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],row.spacy_lg_dict[i+4]['text_low']]
                token['window_lemma'] = [None, None, None, row.spacy_lg_dict[i-1]['lemma_low'], token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],row.spacy_lg_dict[i+4]['lemma_low']]
            elif i == 2:
                print('it chooes 2')
                token['window_text'] = [None, None, row.spacy_lg_dict[i-2]['text_low'],
                                        row.spacy_lg_dict[i-1]['text_low'],token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],row.spacy_lg_dict[i+4]['text_low']]
                token['window_lemma'] = [None, None, row.spacy_lg_dict[i-2]['lemma_low'],
                                         row.spacy_lg_dict[i-1]['lemma_low'],token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],row.spacy_lg_dict[i+4]['lemma_low']]
            elif i == 3:
                print('it chooes 3')
                token['window_text'] = [None, row.spacy_lg_dict[i-3]['text_low'], row.spacy_lg_dict[i-2]['text_low'],
                                        row.spacy_lg_dict[i-1]['text_low'],token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],row.spacy_lg_dict[i+4]['text_low']]
                token['window_lemma'] = [None, row.spacy_lg_dict[i-3]['lemma_low'], row.spacy_lg_dict[i-2]['lemma_low'],
                                         row.spacy_lg_dict[i-1]['lemma_low'],token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],row.spacy_lg_dict[i+4]['lemma_low']]
            elif i == (len(row.spacy_lg_dict)-4):
                print('it chooes -4')
                token['window_text'] = [row.spacy_lg_dict[i-4]['text_low'], row.spacy_lg_dict[i-3]['text_low'],
                                        row.spacy_lg_dict[i-2]['text_low'], row.spacy_lg_dict[i-1]['text_low'],
                                        token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],None]
                token['window_lemma'] = [row.spacy_lg_dict[i-4]['lemma_low'], row.spacy_lg_dict[i-3]['lemma_low'],
                                         row.spacy_lg_dict[i-2]['lemma_low'], row.spacy_lg_dict[i-1]['lemma_low'],
                                         token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],None]
            elif i == (len(row.spacy_lg_dict)-3):
                print('it chooes -3')
                token['window_text'] = [row.spacy_lg_dict[i-4]['text_low'], row.spacy_lg_dict[i-3]['text_low'],
                                        row.spacy_lg_dict[i-2]['text_low'], row.spacy_lg_dict[i-1]['text_low'],
                                        token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        None,None]
                token['window_lemma'] = [row.spacy_lg_dict[i-4]['lemma_low'], row.spacy_lg_dict[i-3]['lemma_low'],
                                         row.spacy_lg_dict[i-2]['lemma_low'], row.spacy_lg_dict[i-1]['lemma_low'],
                                         token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                         None,None]
            elif i == (len(row.spacy_lg_dict)-2):
                print('it chooes -2')
                token['window_text'] = [row.spacy_lg_dict[i-4]['text_low'], row.spacy_lg_dict[i-3]['text_low'],
                                        row.spacy_lg_dict[i-2]['text_low'], row.spacy_lg_dict[i-1]['text_low'],
                                        token['text_low'],row.spacy_lg_dict[i+1]['text_low'],None,None,None]
                token['window_lemma'] = [row.spacy_lg_dict[i-4]['lemma_low'], row.spacy_lg_dict[i-3]['lemma_low'],
                                         row.spacy_lg_dict[i-2]['lemma_low'], row.spacy_lg_dict[i-1]['lemma_low'],
                                         token['lemma_low'],row.spacy_lg_dict[i+1]['lemma_low'],None,None,None]
            elif i == (len(row.spacy_lg_dict)-1):
                print('it chooes -1')
                token['window_text'] = [row.spacy_lg_dict[i-4]['text_low'], row.spacy_lg_dict[i-3]['text_low'],
                                        row.spacy_lg_dict[i-2]['text_low'], row.spacy_lg_dict[i-1]['text_low'],
                                        token['text_low'],None,None,None,None]
                token['window_lemma'] = [row.spacy_lg_dict[i-4]['lemma_low'], row.spacy_lg_dict[i-3]['lemma_low'],
                                         row.spacy_lg_dict[i-2]['lemma_low'], row.spacy_lg_dict[i-1]['lemma_low'],
                                         token['lemma_low'],None,None,None,None]
            else:
                print('it chooes else')
                token['window_text'] = [row.spacy_lg_dict[i-4]['text_low'], row.spacy_lg_dict[i-3]['text_low'],
                                        row.spacy_lg_dict[i-2]['text_low'], row.spacy_lg_dict[i-1]['text_low'],
                                        token['text_low'],
                                        row.spacy_lg_dict[i+1]['text_low'],row.spacy_lg_dict[i+2]['text_low'],
                                        row.spacy_lg_dict[i+3]['text_low'],row.spacy_lg_dict[i+4]['text_low']]
                token['window_lemma'] = [row.spacy_lg_dict[i-4]['lemma_low'], row.spacy_lg_dict[i-3]['lemma_low'],
                                         row.spacy_lg_dict[i-2]['lemma_low'], row.spacy_lg_dict[i-1]['lemma_low'],
                                         token['lemma_low'],
                                        row.spacy_lg_dict[i+1]['lemma_low'],row.spacy_lg_dict[i+2]['lemma_low'],
                                        row.spacy_lg_dict[i+3]['lemma_low'],row.spacy_lg_dict[i+4]['lemma_low']]

            print(token['window_text'])
        
    else:
        for i, token in enumerate(row.spacy_lg_dict):
            token['window_text'] = None
            token['window_lemma'] = None

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
it chooes else
['majority', 'of', 'the', 'federal', 'government', 'public', 'lands', 'deleted', 'sexual']
31 public
it chooes else
['of', 'the', 'federal', 'government', 'public', 'lands', 'deleted', 'sexual', 'orientation']
32 lands
it chooes else
['the', 'federal', 'government', 'public', 'lands', 'deleted', 'sexual', 'orientation', 'from']
33 deleted
it chooes else
['federal', 'government', 'public', 'lands', 'deleted', 'sexual', 'orientation', 'from', 'its']
34 sexual
it chooes else
['government', 'public', 'lands', 'deleted', 'sexual', 'orientation', 'from', 'its', 'anti']
35 orientation
it chooes else
['public', 'lands', 'deleted', 'sexual', 'orientation', 'from', 'its', 'anti', '-']
36 from
it chooes else
['lands', 'deleted', 'sexual', 'orientation', 'from', 'its', 'anti', '-', 'discrimination']
37 its
it chooes else
['deleted', 'sexual', 'orientation', 'from', 'its', 'anti', '-', 'discrimination', 

In [12]:
# token itself
for index, row in dt.iterrows():
    print(index)
    for i, token in enumerate(row.spacy_lg_dict):
        print(i, token['text'])
        if token['window_text'] != None and token['window_lemma'] != None:
            window_text = token['window_text']
            window_lemma = token['window_lemma']

            # words
            c1, c2, c_m, c3, c4 = None, None, None, None, None
            if window_text[2] != None and window_text[3] != None:
                c1 = window_text[2] + ' ' + window_text[3] + ' ' + token['text_low']
            if window_text[3] != None:
                c2 = window_text[3] + ' ' + token['text_low']
            if window_text[3] != None and window_text[5] != None:
                c_m = window_text[3] + ' ' + token['text_low'] + ' ' + window_text[5]
            if window_text[5] != None:
                c3 = token['text_low'] + ' ' + window_text[5]
            if window_text[5] != None and window_text[6] != None:
                c4 = token['text_low'] + ' ' + window_text[5] + ' ' + window_text[6]

            # lemmas
            c1l, c2l, c_ml, c3l, c4l = None, None, None, None, None
            if window_lemma[2] != None and window_lemma[3] != None:
                c1l = window_lemma[2] + ' ' + window_lemma[3] + ' ' + token['lemma_low']
            if window_lemma[3] != None:
                c2l = window_lemma[3] + ' ' + token['lemma_low']
            if window_lemma[3] != None and window_lemma[5] != None:
                c_ml = window_lemma[3] + ' ' + token['lemma_low'] + ' ' + window_lemma[5]
            if window_lemma[5] != None:
                c3l = token['lemma_low'] + ' ' + window_lemma[5]
            if window_lemma[5] != None and window_text[6] != None:
                c4l = token['lemma_low'] + ' ' + window_lemma[5] + ' ' + window_lemma[6]

            # assertive verbs
            if token['text_low'] in assertive_verbs or c1 in assertive_verbs or c2 in assertive_verbs or \
            c_m in assertive_verbs or c3 in assertive_verbs or c4 in assertive_verbs or \
            token['lemma_low'] in assertive_verbs or c1l in assertive_verbs or c2l in assertive_verbs or \
            c_ml in assertive_verbs or c3l in assertive_verbs or c4l in assertive_verbs:
                token['assertive_verbs'] = 1
            else:
                token['assertive_verbs'] = 0   

            # factive verbs
            if token['text_low'] in factive_verbs or c1 in factive_verbs or c2 in factive_verbs or \
            c_m in factive_verbs or c3 in factive_verbs or c4 in factive_verbs or \
            token['lemma_low'] in factive_verbs or c1l in factive_verbs or c2l in factive_verbs or \
            c_ml in factive_verbs or c3l in factive_verbs or c4l in factive_verbs:
                token['factive_verbs'] = 1
            else:
                token['factive_verbs'] = 0   

            # report verbs
            if token['text_low'] in report_verbs or c1 in report_verbs or c2 in report_verbs or \
            c_m in report_verbs or c3 in report_verbs or c4 in report_verbs or \
            token['lemma_low'] in report_verbs or c1l in report_verbs or c2l in report_verbs or \
            c_ml in report_verbs or c3l in report_verbs or c4l in report_verbs:
                token['report_verbs'] = 1
            else:
                token['report_verbs'] = 0

            # hedges
            if token['text_low'] in hedges or c1 in hedges or c2 in hedges or \
            c_m in hedges or c3 in hedges or c4 in hedges or \
            token['lemma_low'] in hedges or c1l in hedges or c2l in hedges or \
            c_ml in hedges or c3l in hedges or c4l in hedges:
                token['hedges'] = 1
            else:
                token['hedges'] = 0

            # boosters
            if token['text_low'] in boosters or c1 in boosters or c2 in boosters or \
            c_m in boosters or c3 in boosters or c4 in boosters or \
            token['lemma_low'] in boosters or c1l in boosters or c2l in boosters or \
            c_ml in boosters or c3l in boosters or c4l in boosters:
                token['boosters'] = 1
            else:
                token['boosters'] = 0
                
            # implicative verbs
            if token['text_low'] in implicative_verbs or c1 in implicative_verbs or c2 in implicative_verbs or \
            c_m in implicative_verbs or c3 in implicative_verbs or c4 in implicative_verbs or \
            token['lemma_low'] in implicative_verbs or c1l in implicative_verbs or c2l in implicative_verbs or \
            c_ml in implicative_verbs or c3l in implicative_verbs or c4l in implicative_verbs:
                token['implicative_verbs'] = 1
            else:
                token['implicative_verbs'] = 0
            
        
        else:
            # assertive verbs
            if token['text_low'] in assertive_verbs or token['lemma_low'] in assertive_verbs:
                token['assertive_verbs'] = 1
            else:
                token['assertive_verbs'] = 0 

            # factive verbs
            if token['text_low'] in factive_verbs or token['lemma_low'] in factive_verbs:
                token['factive_verbs'] = 1
            else:
                token['factive_verbs'] = 0   

            # report verbs
            if token['text_low'] in report_verbs or token['lemma_low'] in report_verbs:
                token['report_verbs'] = 1
            else:
                token['report_verbs'] = 0

            # hedges
            if token['text_low'] in hedges or token['lemma_low'] in hedges:
                token['hedges'] = 1
            else:
                token['hedges'] = 0

            # boosters
            if token['text_low'] in boosters or token['lemma_low'] in boosters:
                token['boosters'] = 1
            else:
                token['boosters'] = 0
                
            # implicative verbs
            if token['text_low'] in implicative_verbs or token['lemma_low'] in implicative_verbs:
                token['implicative_verbs'] = 1
            else:
                token['implicative_verbs'] = 0
            

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
10 cause
11 autism
12 as
13 he
14 did
15 during
16 the
17 presidential
18 debates
19 many
20 parents
21 may
22 choose
23 vaccinate
24 putting
25 children
26 at
27 unnecessary
28 risk
3523
0 When
1 a
2 newly
3 organized
4 vaccine
5 research
6 group
7 at
8 the
9 U.S.
10 National
11 Institutes
12 of
13 Health
14 NIH
15 met
16 for
17 the
18 first
19 time
20 this
21 week
22 its
23 members
24 had
25 expected
26 be
27 able
28 ease
29 into
30 their
31 work
32 their
33 mandate
34 is
35 conduct
36 human
37 trials
38 for
39 emerging
40 health
41 threats
42 their
43 first
44 assignment
45 came
46 at
47 shocking
48 speed
3524
0 When
1 a
2 supporter
3 told
4 Warren
5 public
6 schools
7 need
8 teach
9 more
10 about
11 LGBTQ
12 history
13 sex
14 education
15 the
16 Massachusetts
17 senator
18 replied
19 her
20 education
21 secretary
22 would
23 have
24 be
25 interviewed
26 by
27 a
28 transgender
29 child
3525
0 When
1 car

In [13]:
# token in context -2 (c1)
for index, row in dt.iterrows():
    print(index)
    for i, token in enumerate(row.spacy_lg_dict):
        print(i, token['text_low_C1'])
        if token['text_low_C1'] == None:
            token['assertive_verbs_C1'] = 0
            token['factive_verbs_C1'] = 0
            token['report_verbs_C1'] = 0
            token['hedges_C1'] = 0
            token['boosters_C1'] = 0
            token['implicative_verbs_C1'] = 0
        else:
            if token['window_text'] != None and token['window_lemma'] != None:
                window_text = token['window_text']
                window_lemma = token['window_lemma']

              # 0 1 <2> 3 4 5 6 7 8
              # token['text_low_C1']
              # token['lemma_low_C1']
                # words
                c1, c2, c_m, c3, c4 = None, None, None, None, None
                if window_text[0] != None and window_text[1] != None:
                    c1 = window_text[0] + ' ' + window_text[1] + ' ' + token['text_low_C1']
                if window_text[1] != None:
                    c2 = window_text[1] + ' ' + token['text_low_C1']
                if window_text[1] != None and window_text[3] != None:
                    c_m = window_text[1] + ' ' + token['text_low_C1'] + ' ' + window_text[3]
                if window_text[3] != None:
                    c3 = token['text_low_C1'] + ' ' + window_text[3]
                if window_text[3] != None and window_text[4] != None:
                    c4 = token['text_low_C1'] + ' ' + window_text[3] + ' ' + window_text[4]

                # lemmas
                c1l, c2l, c_ml, c3l, c4l = None, None, None, None, None
                if window_lemma[0] != None and window_lemma[1] != None:
                    c1l = window_lemma[0] + ' ' + window_lemma[1] + ' ' + token['lemma_low_C1']
                if window_lemma[1] != None:
                    c2l = window_lemma[1] + ' ' + token['lemma_low_C1']
                if window_lemma[1] != None and window_lemma[3] != None:
                    c_ml = window_lemma[1] + ' ' + token['lemma_low_C1'] + ' ' + window_lemma[3]
                if window_lemma[3] != None:
                    c3l = token['lemma_low_C1'] + ' ' + window_lemma[3]
                if window_lemma[3] != None and window_text[4] != None:
                    c4l = token['lemma_low_C1'] + ' ' + window_lemma[3] + ' ' + window_lemma[4]

                # assertive verbs
                if token['text_low_C1'] in assertive_verbs or c1 in assertive_verbs or c2 in assertive_verbs or \
                c_m in assertive_verbs or c3 in assertive_verbs or c4 in assertive_verbs or \
                token['lemma_low_C1'] in assertive_verbs or c1l in assertive_verbs or c2l in assertive_verbs or \
                c_ml in assertive_verbs or c3l in assertive_verbs or c4l in assertive_verbs:
                    token['assertive_verbs_C1'] = 1
                else:
                    token['assertive_verbs_C1'] = 0   

                # factive verbs
                if token['text_low_C1'] in factive_verbs or c1 in factive_verbs or c2 in factive_verbs or \
                c_m in factive_verbs or c3 in factive_verbs or c4 in factive_verbs or \
                token['lemma_low_C1'] in factive_verbs or c1l in factive_verbs or c2l in factive_verbs or \
                c_ml in factive_verbs or c3l in factive_verbs or c4l in factive_verbs:
                    token['factive_verbs_C1'] = 1
                else:
                    token['factive_verbs_C1'] = 0   

                # report verbs
                if token['text_low_C1'] in report_verbs or c1 in report_verbs or c2 in report_verbs or \
                c_m in report_verbs or c3 in report_verbs or c4 in report_verbs or \
                token['lemma_low_C1'] in report_verbs or c1l in report_verbs or c2l in report_verbs or \
                c_ml in report_verbs or c3l in report_verbs or c4l in report_verbs:
                    token['report_verbs_C1'] = 1
                else:
                    token['report_verbs_C1'] = 0

                # hedges
                if token['text_low_C1'] in hedges or c1 in hedges or c2 in hedges or \
                c_m in hedges or c3 in hedges or c4 in hedges or \
                token['lemma_low_C1'] in hedges or c1l in hedges or c2l in hedges or \
                c_ml in hedges or c3l in hedges or c4l in hedges:
                    token['hedges_C1'] = 1
                else:
                    token['hedges_C1'] = 0

                # boosters
                if token['text_low_C1'] in boosters or c1 in boosters or c2 in boosters or \
                c_m in boosters or c3 in boosters or c4 in boosters or \
                token['lemma_low_C1'] in boosters or c1l in boosters or c2l in boosters or \
                c_ml in boosters or c3l in boosters or c4l in boosters:
                    token['boosters_C1'] = 1
                else:
                    token['boosters_C1'] = 0
                    
                # implicative verbs
                if token['text_low_C1'] in implicative_verbs or c1 in implicative_verbs or c2 in implicative_verbs or \
                c_m in implicative_verbs or c3 in implicative_verbs or c4 in implicative_verbs or \
                token['lemma_low_C1'] in implicative_verbs or c1l in implicative_verbs or c2l in implicative_verbs or \
                c_ml in implicative_verbs or c3l in implicative_verbs or c4l in implicative_verbs:
                    token['implicative_verbs_C1'] = 1
                else:
                    token['implicative_verbs_C1'] = 0

            else:
                # assertive verbs
                if token['text_low_C1'] in assertive_verbs or token['lemma_low_C1'] in assertive_verbs:
                    token['assertive_verbs_C1'] = 1
                else:
                    token['assertive_verbs_C1'] = 0 

                # factive verbs
                if token['text_low_C1'] in factive_verbs or token['lemma_low_C1'] in factive_verbs:
                    token['factive_verbs_C1'] = 1
                else:
                    token['factive_verbs_C1'] = 0   

                # report verbs
                if token['text_low_C1'] in report_verbs or token['lemma_low_C1'] in report_verbs:
                    token['report_verbs_C1'] = 1
                else:
                    token['report_verbs_C1'] = 0

                # hedges
                if token['text_low_C1'] in hedges or token['lemma_low_C1'] in hedges:
                    token['hedges_C1'] = 1
                else:
                    token['hedges_C1'] = 0

                # boosters
                if token['text_low_C1'] in boosters or token['lemma_low_C1'] in boosters:
                    token['boosters_C1'] = 1
                else:
                    token['boosters_C1'] = 0
                    
                # implicative verbs
                if token['text_low_C1'] in implicative_verbs or token['lemma_low_C1'] in implicative_verbs:
                    token['implicative_verbs_C1'] = 1
                else:
                    token['implicative_verbs_C1'] = 0

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
10 that
11 vaccines
12 cause
13 autism
14 as
15 he
16 did
17 during
18 the
19 presidential
20 debates
21 many
22 parents
23 may
24 choose
25 vaccinate
26 putting
27 children
28 at
3523
0 None
1 None
2 when
3 a
4 newly
5 organized
6 vaccine
7 research
8 group
9 at
10 the
11 u.s.
12 national
13 institutes
14 of
15 health
16 nih
17 met
18 for
19 the
20 first
21 time
22 this
23 week
24 its
25 members
26 had
27 expected
28 be
29 able
30 ease
31 into
32 their
33 work
34 their
35 mandate
36 is
37 conduct
38 human
39 trials
40 for
41 emerging
42 health
43 threats
44 their
45 first
46 assignment
47 came
48 at
3524
0 None
1 None
2 when
3 a
4 supporter
5 told
6 warren
7 public
8 schools
9 need
10 teach
11 more
12 about
13 lgbtq
14 history
15 sex
16 education
17 the
18 massachusetts
19 senator
20 replied
21 her
22 education
23 secretary
24 would
25 have
26 be
27 interviewed
28 by
29 a
3525
0 None
1 None
2 when
3 carry

In [14]:
# token in context -1 (c2)
for index, row in dt.iterrows():
    print(index)
    for i, token in enumerate(row.spacy_lg_dict):
        print(i, token['text_low_C2'])
        if token['text_low_C2'] == None:
            token['assertive_verbs_C2'] = 0
            token['factive_verbs_C2'] = 0
            token['report_verbs_C2'] = 0
            token['hedges_C2'] = 0
            token['boosters_C2'] = 0
            token['implicative_verbs_C2'] = 0
        else:
            if token['window_text'] != None and token['window_lemma'] != None:
                window_text = token['window_text']
                window_lemma = token['window_lemma']

              # 0 1 2 <3> 4 5 6 7 8
              # token['text_low_C2']
              # token['lemma_low_C2']
                # words
                c1, c2, c_m, c3, c4 = None, None, None, None, None
                if window_text[1] != None and window_text[2] != None:
                    c1 = window_text[1] + ' ' + window_text[2] + ' ' + token['text_low_C2']
                if window_text[2] != None:
                    c2 = window_text[2] + ' ' + token['text_low_C2']
                if window_text[2] != None and window_text[4] != None:
                    c_m = window_text[2] + ' ' + token['text_low_C2'] + ' ' + window_text[4]
                if window_text[4] != None:
                    c3 = token['text_low_C2'] + ' ' + window_text[4]
                if window_text[4] != None and window_text[5] != None:
                    c4 = token['text_low_C2'] + ' ' + window_text[4] + ' ' + window_text[5]

                # lemmas
                c1l, c2l, c_ml, c3l, c4l = None, None, None, None, None
                if window_lemma[1] != None and window_lemma[2] != None:
                    c1l = window_lemma[1] + ' ' + window_lemma[2] + ' ' + token['lemma_low_C2']
                if window_lemma[2] != None:
                    c2l = window_lemma[2] + ' ' + token['lemma_low_C2']
                if window_lemma[2] != None and window_lemma[4] != None:
                    c_ml = window_lemma[2] + ' ' + token['lemma_low_C2'] + ' ' + window_lemma[4]
                if window_lemma[4] != None:
                    c3l = token['lemma_low_C2'] + ' ' + window_lemma[4]
                if window_lemma[4] != None and window_text[5] != None:
                    c4l = token['lemma_low_C2'] + ' ' + window_lemma[4] + ' ' + window_lemma[5]

                # assertive verbs
                if token['text_low_C2'] in assertive_verbs or c1 in assertive_verbs or c2 in assertive_verbs or \
                c_m in assertive_verbs or c3 in assertive_verbs or c4 in assertive_verbs or \
                token['lemma_low_C2'] in assertive_verbs or c1l in assertive_verbs or c2l in assertive_verbs or \
                c_ml in assertive_verbs or c3l in assertive_verbs or c4l in assertive_verbs:
                    token['assertive_verbs_C2'] = 1
                else:
                    token['assertive_verbs_C2'] = 0   

                # factive verbs
                if token['text_low_C2'] in factive_verbs or c1 in factive_verbs or c2 in factive_verbs or \
                c_m in factive_verbs or c3 in factive_verbs or c4 in factive_verbs or \
                token['lemma_low_C2'] in factive_verbs or c1l in factive_verbs or c2l in factive_verbs or \
                c_ml in factive_verbs or c3l in factive_verbs or c4l in factive_verbs:
                    token['factive_verbs_C2'] = 1
                else:
                    token['factive_verbs_C2'] = 0   

                # report verbs
                if token['text_low_C2'] in report_verbs or c1 in report_verbs or c2 in report_verbs or \
                c_m in report_verbs or c3 in report_verbs or c4 in report_verbs or \
                token['lemma_low_C2'] in report_verbs or c1l in report_verbs or c2l in report_verbs or \
                c_ml in report_verbs or c3l in report_verbs or c4l in report_verbs:
                    token['report_verbs_C2'] = 1
                else:
                    token['report_verbs_C2'] = 0

                # hedges
                if token['text_low_C2'] in hedges or c1 in hedges or c2 in hedges or \
                c_m in hedges or c3 in hedges or c4 in hedges or \
                token['lemma_low_C2'] in hedges or c1l in hedges or c2l in hedges or \
                c_ml in hedges or c3l in hedges or c4l in hedges:
                    token['hedges_C2'] = 1
                else:
                    token['hedges_C2'] = 0

                # boosters
                if token['text_low_C2'] in boosters or c1 in boosters or c2 in boosters or \
                c_m in boosters or c3 in boosters or c4 in boosters or \
                token['lemma_low_C2'] in boosters or c1l in boosters or c2l in boosters or \
                c_ml in boosters or c3l in boosters or c4l in boosters:
                    token['boosters_C2'] = 1
                else:
                    token['boosters_C2'] = 0
                    
                # implicative verbs
                if token['text_low_C2'] in implicative_verbs or c1 in implicative_verbs or c2 in implicative_verbs or \
                c_m in implicative_verbs or c3 in implicative_verbs or c4 in implicative_verbs or \
                token['lemma_low_C2'] in implicative_verbs or c1l in implicative_verbs or c2l in implicative_verbs or \
                c_ml in implicative_verbs or c3l in implicative_verbs or c4l in implicative_verbs:
                    token['implicative_verbs_C2'] = 1
                else:
                    token['implicative_verbs_C2'] = 0

            else:
                # assertive verbs
                if token['text_low_C2'] in assertive_verbs or token['lemma_low_C2'] in assertive_verbs:
                    token['assertive_verbs_C2'] = 1
                else:
                    token['assertive_verbs_C2'] = 0 

                # factive verbs
                if token['text_low_C2'] in factive_verbs or token['lemma_low_C2'] in factive_verbs:
                    token['factive_verbs_C2'] = 1
                else:
                    token['factive_verbs_C2'] = 0   

                # report verbs
                if token['text_low_C2'] in report_verbs or token['lemma_low_C2'] in report_verbs:
                    token['report_verbs_C2'] = 1
                else:
                    token['report_verbs_C2'] = 0

                # hedges
                if token['text_low_C2'] in hedges or token['lemma_low_C2'] in hedges:
                    token['hedges_C2'] = 1
                else:
                    token['hedges_C2'] = 0

                # boosters
                if token['text_low_C2'] in boosters or token['lemma_low_C2'] in boosters:
                    token['boosters_C2'] = 1
                else:
                    token['boosters_C2'] = 0
                    
                # implicative verbs
                if token['text_low_C2'] in implicative_verbs or token['lemma_low_C2'] in implicative_verbs:
                    token['implicative_verbs_C2'] = 1
                else:
                    token['implicative_verbs_C2'] = 0

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
10 vaccines
11 cause
12 autism
13 as
14 he
15 did
16 during
17 the
18 presidential
19 debates
20 many
21 parents
22 may
23 choose
24 vaccinate
25 putting
26 children
27 at
28 unnecessary
3523
0 None
1 when
2 a
3 newly
4 organized
5 vaccine
6 research
7 group
8 at
9 the
10 u.s.
11 national
12 institutes
13 of
14 health
15 nih
16 met
17 for
18 the
19 first
20 time
21 this
22 week
23 its
24 members
25 had
26 expected
27 be
28 able
29 ease
30 into
31 their
32 work
33 their
34 mandate
35 is
36 conduct
37 human
38 trials
39 for
40 emerging
41 health
42 threats
43 their
44 first
45 assignment
46 came
47 at
48 shocking
3524
0 None
1 when
2 a
3 supporter
4 told
5 warren
6 public
7 schools
8 need
9 teach
10 more
11 about
12 lgbtq
13 history
14 sex
15 education
16 the
17 massachusetts
18 senator
19 replied
20 her
21 education
22 secretary
23 would
24 have
25 be
26 interviewed
27 by
28 a
29 transgender
3525
0 None
1 w

In [15]:
# token in context +1 (c3)
for index, row in dt.iterrows():
    print(index)
    for i, token in enumerate(row.spacy_lg_dict):
        print(i, token['text_low_C3'])
        if token['text_low_C3'] == None:
            token['assertive_verbs_C3'] = 0
            token['factive_verbs_C3'] = 0
            token['report_verbs_C3'] = 0
            token['hedges_C3'] = 0
            token['boosters_C3'] = 0
            token['implicative_verbs_C3'] = 0
        else:
            if token['window_text'] != None and token['window_lemma'] != None:
                window_text = token['window_text']
                window_lemma = token['window_lemma']

              # 0 1 2 3 4 <5> 6 7 8
              # token['text_low_C3']
              # token['lemma_low_C3']
                # words
                c1, c2, c_m, c3, c4 = None, None, None, None, None
                if window_text[3] != None and window_text[4] != None:
                    c1 = window_text[3] + ' ' + window_text[4] + ' ' + token['text_low_C3']
                if window_text[4] != None:
                    c2 = window_text[4] + ' ' + token['text_low_C3']
                if window_text[4] != None and window_text[6] != None:
                    c_m = window_text[4] + ' ' + token['text_low_C3'] + ' ' + window_text[6]
                if window_text[6] != None:
                    c3 = token['text_low_C3'] + ' ' + window_text[6]
                if window_text[6] != None and window_text[7] != None:
                    c4 = token['text_low_C3'] + ' ' + window_text[6] + ' ' + window_text[7]

                # lemmas
                c1l, c2l, c_ml, c3l, c4l = None, None, None, None, None
                if window_lemma[3] != None and window_lemma[4] != None:
                    c1l = window_lemma[3] + ' ' + window_lemma[4] + ' ' + token['lemma_low_C3']
                if window_lemma[4] != None:
                    c2l = window_lemma[4] + ' ' + token['lemma_low_C3']
                if window_lemma[4] != None and window_lemma[6] != None:
                    c_ml = window_lemma[4] + ' ' + token['lemma_low_C3'] + ' ' + window_lemma[6]
                if window_lemma[6] != None:
                    c3l = token['lemma_low_C3'] + ' ' + window_lemma[6]
                if window_lemma[6] != None and window_text[7] != None:
                    c4l = token['lemma_low_C3'] + ' ' + window_lemma[6] + ' ' + window_lemma[7]

                # assertive verbs
                if token['text_low_C3'] in assertive_verbs or c1 in assertive_verbs or c2 in assertive_verbs or \
                c_m in assertive_verbs or c3 in assertive_verbs or c4 in assertive_verbs or \
                token['lemma_low_C3'] in assertive_verbs or c1l in assertive_verbs or c2l in assertive_verbs or \
                c_ml in assertive_verbs or c3l in assertive_verbs or c4l in assertive_verbs:
                    token['assertive_verbs_C3'] = 1
                else:
                    token['assertive_verbs_C3'] = 0   

                # factive verbs
                if token['text_low_C3'] in factive_verbs or c1 in factive_verbs or c2 in factive_verbs or \
                c_m in factive_verbs or c3 in factive_verbs or c4 in factive_verbs or \
                token['lemma_low_C3'] in factive_verbs or c1l in factive_verbs or c2l in factive_verbs or \
                c_ml in factive_verbs or c3l in factive_verbs or c4l in factive_verbs:
                    token['factive_verbs_C3'] = 1
                else:
                    token['factive_verbs_C3'] = 0   

                # report verbs
                if token['text_low_C3'] in report_verbs or c1 in report_verbs or c2 in report_verbs or \
                c_m in report_verbs or c3 in report_verbs or c4 in report_verbs or \
                token['lemma_low_C3'] in report_verbs or c1l in report_verbs or c2l in report_verbs or \
                c_ml in report_verbs or c3l in report_verbs or c4l in report_verbs:
                    token['report_verbs_C3'] = 1
                else:
                    token['report_verbs_C3'] = 0

                # hedges
                if token['text_low_C3'] in hedges or c1 in hedges or c2 in hedges or \
                c_m in hedges or c3 in hedges or c4 in hedges or \
                token['lemma_low_C3'] in hedges or c1l in hedges or c2l in hedges or \
                c_ml in hedges or c3l in hedges or c4l in hedges:
                    token['hedges_C3'] = 1
                else:
                    token['hedges_C3'] = 0

                # boosters
                if token['text_low_C3'] in boosters or c1 in boosters or c2 in boosters or \
                c_m in boosters or c3 in boosters or c4 in boosters or \
                token['lemma_low_C3'] in boosters or c1l in boosters or c2l in boosters or \
                c_ml in boosters or c3l in boosters or c4l in boosters:
                    token['boosters_C3'] = 1
                else:
                    token['boosters_C3'] = 0
                    
                # implicative verbs
                if token['text_low_C3'] in implicative_verbs or c1 in implicative_verbs or c2 in implicative_verbs or \
                c_m in implicative_verbs or c3 in implicative_verbs or c4 in implicative_verbs or \
                token['lemma_low_C3'] in implicative_verbs or c1l in implicative_verbs or c2l in implicative_verbs or \
                c_ml in implicative_verbs or c3l in implicative_verbs or c4l in implicative_verbs:
                    token['implicative_verbs_C3'] = 1
                else:
                    token['implicative_verbs_C3'] = 0

            else:
                # assertive verbs
                if token['text_low_C3'] in assertive_verbs or token['lemma_low_C3'] in assertive_verbs:
                    token['assertive_verbs_C3'] = 1
                else:
                    token['assertive_verbs_C3'] = 0 

                # factive verbs
                if token['text_low_C3'] in factive_verbs or token['lemma_low_C3'] in factive_verbs:
                    token['factive_verbs_C3'] = 1
                else:
                    token['factive_verbs_C3'] = 0   

                # report verbs
                if token['text_low_C3'] in report_verbs or token['lemma_low_C3'] in report_verbs:
                    token['report_verbs_C3'] = 1
                else:
                    token['report_verbs_C3'] = 0

                # hedges
                if token['text_low_C3'] in hedges or token['lemma_low_C3'] in hedges:
                    token['hedges_C3'] = 1
                else:
                    token['hedges_C3'] = 0

                # boosters
                if token['text_low_C3'] in boosters or token['lemma_low_C3'] in boosters:
                    token['boosters_C3'] = 1
                else:
                    token['boosters_C3'] = 0
                    
                # implicative verbs
                if token['text_low_C3'] in implicative_verbs or token['lemma_low_C3'] in implicative_verbs:
                    token['implicative_verbs_C3'] = 1
                else:
                    token['implicative_verbs_C3'] = 0

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
10 autism
11 as
12 he
13 did
14 during
15 the
16 presidential
17 debates
18 many
19 parents
20 may
21 choose
22 vaccinate
23 putting
24 children
25 at
26 unnecessary
27 risk
28 None
3523
0 a
1 newly
2 organized
3 vaccine
4 research
5 group
6 at
7 the
8 u.s.
9 national
10 institutes
11 of
12 health
13 nih
14 met
15 for
16 the
17 first
18 time
19 this
20 week
21 its
22 members
23 had
24 expected
25 be
26 able
27 ease
28 into
29 their
30 work
31 their
32 mandate
33 is
34 conduct
35 human
36 trials
37 for
38 emerging
39 health
40 threats
41 their
42 first
43 assignment
44 came
45 at
46 shocking
47 speed
48 None
3524
0 a
1 supporter
2 told
3 warren
4 public
5 schools
6 need
7 teach
8 more
9 about
10 lgbtq
11 history
12 sex
13 education
14 the
15 massachusetts
16 senator
17 replied
18 her
19 education
20 secretary
21 would
22 have
23 be
24 interviewed
25 by
26 a
27 transgender
28 child
29 None
3525
0 carrying
1 

In [16]:
# token in context +2 (c4)
for index, row in dt.iterrows():
    print(index)
    for i, token in enumerate(row.spacy_lg_dict):
        print(i, token['text_low_C4'])
        if token['text_low_C4'] == None:
            token['assertive_verbs_C4'] = 0
            token['factive_verbs_C4'] = 0
            token['report_verbs_C4'] = 0
            token['hedges_C4'] = 0
            token['boosters_C4'] = 0
            token['implicative_verbs_C4'] = 0
        else:
            if token['window_text'] != None and token['window_lemma'] != None:
                window_text = token['window_text']
                window_lemma = token['window_lemma']

              # 0 1 2 3 4 5 <6> 7 8
              # token['text_low_C4']
              # token['lemma_low_C4']
                # words
                c1, c2, c_m, c3, c4 = None, None, None, None, None
                if window_text[4] != None and window_text[5] != None:
                    c1 = window_text[4] + ' ' + window_text[5] + ' ' + token['text_low_C4']
                if window_text[5] != None:
                    c2 = window_text[5] + ' ' + token['text_low_C4']
                if window_text[5] != None and window_text[7] != None:
                    c_m = window_text[5] + ' ' + token['text_low_C4'] + ' ' + window_text[7]
                if window_text[7] != None:
                    c3 = token['text_low_C4'] + ' ' + window_text[7]
                if window_text[7] != None and window_text[8] != None:
                    c4 = token['text_low_C4'] + ' ' + window_text[7] + ' ' + window_text[8]

                # lemmas
                c1l, c2l, c_ml, c3l, c4l = None, None, None, None, None
                if window_lemma[4] != None and window_lemma[5] != None:
                    c1l = window_lemma[4] + ' ' + window_lemma[5] + ' ' + token['lemma_low_C4']
                if window_lemma[5] != None:
                    c2l = window_lemma[5] + ' ' + token['lemma_low_C4']
                if window_lemma[5] != None and window_lemma[7] != None:
                    c_ml = window_lemma[5] + ' ' + token['lemma_low_C4'] + ' ' + window_lemma[7]
                if window_lemma[7] != None:
                    c3l = token['lemma_low_C4'] + ' ' + window_lemma[7]
                if window_lemma[7] != None and window_text[8] != None:
                    c4l = token['lemma_low_C4'] + ' ' + window_lemma[7] + ' ' + window_lemma[8]

                # assertive verbs
                if token['text_low_C4'] in assertive_verbs or c1 in assertive_verbs or c2 in assertive_verbs or \
                c_m in assertive_verbs or c3 in assertive_verbs or c4 in assertive_verbs or \
                token['lemma_low_C4'] in assertive_verbs or c1l in assertive_verbs or c2l in assertive_verbs or \
                c_ml in assertive_verbs or c3l in assertive_verbs or c4l in assertive_verbs:
                    token['assertive_verbs_C4'] = 1
                else:
                    token['assertive_verbs_C4'] = 0   

                # factive verbs
                if token['text_low_C4'] in factive_verbs or c1 in factive_verbs or c2 in factive_verbs or \
                c_m in factive_verbs or c3 in factive_verbs or c4 in factive_verbs or \
                token['lemma_low_C4'] in factive_verbs or c1l in factive_verbs or c2l in factive_verbs or \
                c_ml in factive_verbs or c3l in factive_verbs or c4l in factive_verbs:
                    token['factive_verbs_C4'] = 1
                else:
                    token['factive_verbs_C4'] = 0   

                # report verbs
                if token['text_low_C4'] in report_verbs or c1 in report_verbs or c2 in report_verbs or \
                c_m in report_verbs or c3 in report_verbs or c4 in report_verbs or \
                token['lemma_low_C4'] in report_verbs or c1l in report_verbs or c2l in report_verbs or \
                c_ml in report_verbs or c3l in report_verbs or c4l in report_verbs:
                    token['report_verbs_C4'] = 1
                else:
                    token['report_verbs_C4'] = 0

                # hedges
                if token['text_low_C4'] in hedges or c1 in hedges or c2 in hedges or \
                c_m in hedges or c3 in hedges or c4 in hedges or \
                token['lemma_low_C4'] in hedges or c1l in hedges or c2l in hedges or \
                c_ml in hedges or c3l in hedges or c4l in hedges:
                    token['hedges_C4'] = 1
                else:
                    token['hedges_C4'] = 0

                # boosters
                if token['text_low_C4'] in boosters or c1 in boosters or c2 in boosters or \
                c_m in boosters or c3 in boosters or c4 in boosters or \
                token['lemma_low_C4'] in boosters or c1l in boosters or c2l in boosters or \
                c_ml in boosters or c3l in boosters or c4l in boosters:
                    token['boosters_C4'] = 1
                else:
                    token['boosters_C4'] = 0
                    
                # implicative verbs
                if token['text_low_C4'] in implicative_verbs or c1 in implicative_verbs or c2 in implicative_verbs or \
                c_m in implicative_verbs or c3 in implicative_verbs or c4 in implicative_verbs or \
                token['lemma_low_C4'] in implicative_verbs or c1l in implicative_verbs or c2l in implicative_verbs or \
                c_ml in implicative_verbs or c3l in implicative_verbs or c4l in implicative_verbs:
                    token['implicative_verbs_C4'] = 1
                else:
                    token['implicative_verbs_C4'] = 0

            else:
                # assertive verbs
                if token['text_low_C4'] in assertive_verbs or token['lemma_low_C4'] in assertive_verbs:
                    token['assertive_verbs_C4'] = 1
                else:
                    token['assertive_verbs_C4'] = 0 

                # factive verbs
                if token['text_low_C4'] in factive_verbs or token['lemma_low_C4'] in factive_verbs:
                    token['factive_verbs_C4'] = 1
                else:
                    token['factive_verbs_C4'] = 0   

                # report verbs
                if token['text_low_C4'] in report_verbs or token['lemma_low_C4'] in report_verbs:
                    token['report_verbs_C4'] = 1
                else:
                    token['report_verbs_C4'] = 0

                # hedges
                if token['text_low_C4'] in hedges or token['lemma_low_C4'] in hedges:
                    token['hedges_C4'] = 1
                else:
                    token['hedges_C4'] = 0

                # boosters
                if token['text_low_C4'] in boosters or token['lemma_low_C4'] in boosters:
                    token['boosters_C4'] = 1
                else:
                    token['boosters_C4'] = 0
                    
                # implicative verbs
                if token['text_low_C4'] in implicative_verbs or token['lemma_low_C4'] in implicative_verbs:
                    token['implicative_verbs_C4'] = 1
                else:
                    token['implicative_verbs_C4'] = 0

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
10 as
11 he
12 did
13 during
14 the
15 presidential
16 debates
17 many
18 parents
19 may
20 choose
21 vaccinate
22 putting
23 children
24 at
25 unnecessary
26 risk
27 None
28 None
3523
0 newly
1 organized
2 vaccine
3 research
4 group
5 at
6 the
7 u.s.
8 national
9 institutes
10 of
11 health
12 nih
13 met
14 for
15 the
16 first
17 time
18 this
19 week
20 its
21 members
22 had
23 expected
24 be
25 able
26 ease
27 into
28 their
29 work
30 their
31 mandate
32 is
33 conduct
34 human
35 trials
36 for
37 emerging
38 health
39 threats
40 their
41 first
42 assignment
43 came
44 at
45 shocking
46 speed
47 None
48 None
3524
0 supporter
1 told
2 warren
3 public
4 schools
5 need
6 teach
7 more
8 about
9 lgbtq
10 history
11 sex
12 education
13 the
14 massachusetts
15 senator
16 replied
17 her
18 education
19 secretary
20 would
21 have
22 be
23 interviewed
24 by
25 a
26 transgender
27 child
28 None
29 None
3525
0 a
1 fir

## 3 Create a feature vector for each word

### 3.1 Ungroup sentences (one obseravtion = one word)

In [17]:
dt

Unnamed: 0,sentence,outlet,topic,type,biased_words2,spacy_lg,spacy_lg_dict,article
0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","[{'text': 'Orange', 'text_low': 'orange', 'pos...","""Orange Is the New Black"" star Yael Stone is r..."
1,"""We have one beautiful law,"" Trump recently sa...",Alternet,gun control,left,"[bizarre, characteristically]","("", We, have, one, beautiful, law, ,, "", Trump...","[{'text': 'We', 'text_low': 'we', 'pos': 'PRON...","""We have one beautiful law,"" Trump recently sa..."
2,"...immigrants as criminals and eugenics, all o...",MSNBC,white-nationalism,left,"[criminals, fringe, extreme]","(..., immigrants, as, criminals, and, eugenics...","[{'text': 'immigrants', 'text_low': 'immigrant...","...immigrants as criminals and eugenics, all o..."
3,...we sounded the alarm in the early months of...,Alternet,white-nationalism,left,[],"(..., we, sounded, the, alarm, in, the, early,...","[{'text': 'we', 'text_low': 'we', 'pos': 'PRON...",...we sounded the alarm in the early months of...
4,[Black Lives Matter] is essentially a non-fals...,Breitbart,marriage-equality,,[cult],"([, Black, Lives, Matter, ], is, essentially, ...","[{'text': 'Black', 'text_low': 'black', 'pos':...",[Black Lives Matter] is essentially a non-fals...
...,...,...,...,...,...,...,...,...
3669,You’ve heard of Jim Crow and Southern Segregat...,Breitbart,marriage-equality,,[ALL],"(You, ’ve, heard, of, Jim, Crow, and, Southern...","[{'text': 'You', 'text_low': 'you', 'pos': 'PR...",You’ve heard of Jim Crow and Southern Segregat...
3670,Young female athletes’ dreams and accomplishme...,Breitbart,marriage-equality,,"[dashed, ""identify""]","(Young, female, athletes, ’, dreams, and, acco...","[{'text': 'Young', 'text_low': 'young', 'pos':...",Young female athletes’ dreams and accomplishme...
3671,"Young white men, reacting to social and educat...",Federalist,white-nationalism,right,"[evil, white]","(Young, white, men, ,, reacting, to, social, a...","[{'text': 'Young', 'text_low': 'young', 'pos':...","Young white men, reacting to social and educat..."
3672,Young women taking part in high school and col...,Breitbart,sport,right,"[dashed, ""identify""]","(Young, women, taking, part, in, high, school,...","[{'text': 'Young', 'text_low': 'young', 'pos':...",Young women taking part in high school and col...


In [18]:
for feat in dt.loc[1,'spacy_lg_dict']:
  print(feat)

{'text': 'We', 'text_low': 'we', 'pos': 'PRON', 'lemma': '-PRON-', 'lemma_low': '-pron-', 'tag': 'PRP', 'dep': 'nsubj', 'shape': 'Xx', 'is_alpha': True, 'is_stop': True, 'has_vec': True, 'glove_vec300': array([-0.31676668,  4.087597  , -0.1161238 , -2.0217166 , -0.45238057,
        0.35491768,  1.4899944 ,  3.952473  ,  3.8075526 ,  1.7800549 ,
       -2.2092876 ,  3.6313457 ,  0.17599165,  1.5715055 ,  1.0988023 ,
       -1.1550162 , -2.6193419 ,  0.01218152, -2.9498038 ,  0.79302603,
        0.9338388 , -3.291411  , -1.6779103 , -3.2861583 , -3.2453792 ,
       -1.7613271 , -0.96935546, -0.98925966, -1.137018  ,  0.32698405,
        2.601997  , -3.179267  , -2.429421  , -0.9683772 ,  6.714281  ,
        0.01271152, -2.3952408 ,  4.1934476 , -0.15137061, -2.463461  ,
        1.3733284 , -0.6784345 , -0.99732786, -0.490483  , -0.7643031 ,
        2.7650192 ,  0.18923801, -1.2285393 , -2.5842826 ,  1.0963786 ,
       -1.2242095 ,  2.2945304 , -2.0236936 ,  1.5776336 ,  0.6756185 ,
     

In [19]:
dt.columns

Index(['sentence', 'outlet', 'topic', 'type', 'biased_words2', 'spacy_lg',
       'spacy_lg_dict', 'article'],
      dtype='object')

In [20]:
#dt = dt[['sentence','outlet','topic','type','num_sent','article','biased_words','spacy_lg','spacy_lg_dict']]

rows = []
dt.apply(lambda row: len([rows.append([row['sentence'],
                                               row['outlet'],
                                               row['topic'],
                                               row['type'],
                                               row['biased_words2'],
                                               row['spacy_lg'],
                                               row['article'], t]) for t in row.spacy_lg_dict]), axis=1)
dt_ungrouped = pd.DataFrame(rows, columns=['sentence', 'outlet', 'topic', 'type', 'biased_words2', 'spacy_lg', 'article', 'spacy_lg_dict'])

print('The length of the initial datset with sentences is:', len(dt),
      ', the length of the exposed datset with words is:', len(dt_ungrouped))

The length of the initial datset with sentences is: 3673 , the length of the exposed datset with words is: 110269


In [21]:
dt.head(2)

Unnamed: 0,sentence,outlet,topic,type,biased_words2,spacy_lg,spacy_lg_dict,article
0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","[{'text': 'Orange', 'text_low': 'orange', 'pos...","""Orange Is the New Black"" star Yael Stone is r..."
1,"""We have one beautiful law,"" Trump recently sa...",Alternet,gun control,left,"[bizarre, characteristically]","("", We, have, one, beautiful, law, ,, "", Trump...","[{'text': 'We', 'text_low': 'we', 'pos': 'PRON...","""We have one beautiful law,"" Trump recently sa..."


In [22]:
dt_ungrouped.head(2)

Unnamed: 0,sentence,outlet,topic,type,biased_words2,spacy_lg,article,spacy_lg_dict
0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","""Orange Is the New Black"" star Yael Stone is r...","{'text': 'Orange', 'text_low': 'orange', 'pos'..."
1,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","""Orange Is the New Black"" star Yael Stone is r...","{'text': 'Is', 'text_low': 'is', 'pos': 'AUX',..."


### 3.2 Create a column for each feature

In [23]:
features = list(dt_ungrouped['spacy_lg_dict'][0].keys())

for feat in features:
    dt_ungrouped[feat] = dt_ungrouped["spacy_lg_dict"].apply(lambda x: x[feat])

dt_ungrouped.head(2)

Unnamed: 0,sentence,outlet,topic,type,biased_words2,spacy_lg,article,spacy_lg_dict,text,text_low,pos,lemma,lemma_low,tag,dep,shape,is_alpha,is_stop,has_vec,glove_vec300,glove_vec300_norm,is_oov,order,tfidf_art,label3,label4,label5,is_ne,ne_label,liwc2015,negative_conc,positive_conc,weak_subj,strong_subj,MRCP_concretness_ratings,MRCP_Imagability_ratings,hyperbolic_terms,attitude_markers,kill_verbs,bias_lexicon,...,weak_subj_C4,strong_subj_C4,MRCP_concretness_ratings_C4,MRCP_Imagability_ratings_C4,hyperbolic_terms_C4,attitude_markers_C4,kill_verbs_C4,bias_lexicon_C4,window_text,window_lemma,assertive_verbs,factive_verbs,report_verbs,hedges,boosters,implicative_verbs,assertive_verbs_C1,factive_verbs_C1,report_verbs_C1,hedges_C1,boosters_C1,implicative_verbs_C1,assertive_verbs_C2,factive_verbs_C2,report_verbs_C2,hedges_C2,boosters_C2,implicative_verbs_C2,assertive_verbs_C3,factive_verbs_C3,report_verbs_C3,hedges_C3,boosters_C3,implicative_verbs_C3,assertive_verbs_C4,factive_verbs_C4,report_verbs_C4,hedges_C4,boosters_C4,implicative_verbs_C4
0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","""Orange Is the New Black"" star Yael Stone is r...","{'text': 'Orange', 'text_low': 'orange', 'pos'...",Orange,orange,PROPN,Orange,orange,NNP,nsubj,Xxxxx,True,False,True,"[0.9383973, -1.9984279, -0.5027343, 1.1871433,...",22.000772,True,0,0.278468,0,0,0,True,WORK_OF_ART,"[21, 60, 61]",0,0,0,0,601.0,626.0,0,0,0,0,...,0.0,0.0,237.0,209.0,0.0,0.0,0.0,0.0,"[None, None, None, None, orange, is, the, new,...","[None, None, None, None, orange, be, the, new,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,[],"("", Orange, Is, the, New, Black, "", star, Yael...","""Orange Is the New Black"" star Yael Stone is r...","{'text': 'Is', 'text_low': 'is', 'pos': 'AUX',...",Is,is,AUX,be,be,VBZ,ROOT,Xx,True,True,True,"[0.76133776, 1.159646, -3.5622559, 1.8416485, ...",23.621582,True,1,0.179649,0,0,0,True,WORK_OF_ART,"[1, 12, 20, 91]",0,0,0,0,223.0,230.0,0,0,0,0,...,0.0,0.0,348.0,418.0,0.0,0.0,0.0,0.0,"[None, None, None, orange, is, the, new, black...","[None, None, None, orange, be, the, new, black...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 3.3 Expand categorical features (one-hot encoding)

Categorical features:
- liwc2015
- pos
- dep
- ne_label

In [24]:
dt_ungrouped['liwc2015_C1'] = [[] if x is x==None else x for x in dt_ungrouped['liwc2015_C1']]
dt_ungrouped['liwc2015_C2'] = [[] if x is x==None else x for x in dt_ungrouped['liwc2015_C2']]
dt_ungrouped['liwc2015_C3'] = [[] if x is x==None else x for x in dt_ungrouped['liwc2015_C3']]
dt_ungrouped['liwc2015_C4'] = [[] if x is x==None else x for x in dt_ungrouped['liwc2015_C4']]

# expand liwc to binary features
for i, row in liwc2015_codes.iterrows():
    print(row.code, row.short_description)
    # for token
    dt_ungrouped[row.short_description] = dt_ungrouped.liwc2015.apply(lambda x: 1 if row.code in x else 0)
    # for c1
    c1 = row.short_description + '_C1'
    dt_ungrouped[c1] = dt_ungrouped.liwc2015_C1.apply(lambda x: 1 if row.code in x else 0)
    # for c2
    c2 = row.short_description + '_C2'
    dt_ungrouped[c2] = dt_ungrouped.liwc2015_C2.apply(lambda x: 1 if row.code in x else 0)
    # for c3
    c3 = row.short_description + '_C3'
    dt_ungrouped[c3] = dt_ungrouped.liwc2015_C3.apply(lambda x: 1 if row.code in x else 0)
    # for c4
    c4 = row.short_description + '_C4'
    dt_ungrouped[c4] = dt_ungrouped.liwc2015_C4.apply(lambda x: 1 if row.code in x else 0)

30 affect 
31 posemo 
32 negemo 
33 anx 
34 anger 
35 sad 
40 social 
41 family 
42 friend 
43 female 
44 male 
50 cogproc 
51 insight 
52 cause 
53 discrep 
54 tentat 
55 certain 
56 differ 
60 percept 
61 see 
62 hear 
63 feel 
70 bio 
71 body 
72 health 
73 sexual 
74 ingest 
80 drives 
81 affiliation 
82 achieve 
83 power 
84 reward 
85 risk 
90 focuspast 
91 focuspresent 
92 focusfuture 
100 relativ 
101 motion 
102 space 
103 time 
110 work 
111 leisure 
112 home 
113 money 
114 relig 
115 death 
120 informal 
121 swear 
122 netspeak 
123 assent 
124 nonflu 
125 filler 


In [25]:
pos_token = pd.get_dummies(dt_ungrouped.pos, prefix='pos')
pos_c1 = pd.get_dummies(dt_ungrouped.pos_C1, prefix='pos')
pos_c2 = pd.get_dummies(dt_ungrouped.pos_C2, prefix='pos')
pos_c3 = pd.get_dummies(dt_ungrouped.pos_C3, prefix='pos')
pos_c4 = pd.get_dummies(dt_ungrouped.pos_C4, prefix='pos')

pos_c1 = pos_c1.add_suffix('_C1')
pos_c2 = pos_c2.add_suffix('_C2')
pos_c3 = pos_c3.add_suffix('_C3')
pos_c4 = pos_c4.add_suffix('_C4')

dt_ungrouped = pd.merge(dt_ungrouped, pos_token, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, pos_c1, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, pos_c2, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, pos_c3, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, pos_c4, left_index=True, right_index=True, how='left')

In [26]:
dep_token = pd.get_dummies(dt_ungrouped.dep, prefix='dep')
dep_c1 = pd.get_dummies(dt_ungrouped.dep_C1, prefix='dep')
dep_c2 = pd.get_dummies(dt_ungrouped.dep_C2, prefix='dep')
dep_c3 = pd.get_dummies(dt_ungrouped.dep_C3, prefix='dep')
dep_c4 = pd.get_dummies(dt_ungrouped.dep_C4, prefix='dep')
dep_c1 = dep_c1.add_suffix('_C1')
dep_c2 = dep_c2.add_suffix('_C2')
dep_c3 = dep_c3.add_suffix('_C3')
dep_c4 = dep_c4.add_suffix('_C4')

dt_ungrouped = pd.merge(dt_ungrouped, dep_token, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, dep_c1, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, dep_c2, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, dep_c3, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, dep_c4, left_index=True, right_index=True, how='left')

In [27]:
ne_token = pd.get_dummies(dt_ungrouped.ne_label, prefix='ne')
ne_c1 = pd.get_dummies(dt_ungrouped.ne_label_C1, prefix='ne')
ne_c2 = pd.get_dummies(dt_ungrouped.ne_label_C2, prefix='ne')
ne_c3 = pd.get_dummies(dt_ungrouped.ne_label_C3, prefix='ne')
ne_c4 = pd.get_dummies(dt_ungrouped.ne_label_C4, prefix='ne')
ne_c1 = ne_c1.add_suffix('_C1')
ne_c2 = ne_c2.add_suffix('_C2')
ne_c3 = ne_c3.add_suffix('_C3')
ne_c4 = ne_c4.add_suffix('_C4')

dt_ungrouped = pd.merge(dt_ungrouped, ne_token, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, ne_c1, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, ne_c2, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, ne_c3, left_index=True, right_index=True, how='left')
dt_ungrouped = pd.merge(dt_ungrouped, ne_c4, left_index=True, right_index=True, how='left')

### 3.4 Combine context features into a collective feature describing 4 words around the word of interest

In [28]:
binary_features_for_training = [
 'negative_conc',
 'positive_conc',
 'weak_subj',
 'strong_subj',
 'hyperbolic_terms',
 'attitude_markers',
 'kill_verbs',
 'bias_lexicon',
 'assertive_verbs',
 'factive_verbs',
 'report_verbs',
 'implicative_verbs',
 'hedges',
 'boosters',
 'affect ',
 'posemo ',
 'negemo ',
 'anx ',
 'anger ',
 'sad ',
 'social ',
 'family ',
 'friend ',
 'female ',
 'male ',
 'cogproc ',
 'insight ',
 'cause ',
 'discrep ',
 'tentat ',
 'certain ',
 'differ ',
 'percept ',
 'see ',
 'hear ',
 'feel ',
 'bio ',
 'body ',
 'health ',
 'sexual ',
 'ingest ',
 'drives ',
 'affiliation ',
 'achieve ',
 'power ',
 'reward ',
 'risk ',
 'focuspast ',
 'focuspresent ',
 'focusfuture ',
 'relativ ',
 'motion ',
 'space ',
 'time ',
 'work ',
 'leisure ',
 'home ',
 'money ',
 'relig ',
 'death ',
 'informal ',
 'swear ',
 'netspeak ',
 'assent ',
 'nonflu ',
 'filler ',
 'pos_ADJ',
 'pos_ADP',
 'pos_ADV',
 'pos_AUX',
 'pos_DET',
 'pos_INTJ',
 'pos_NOUN',
 'pos_PRON',
 'pos_PROPN',
 'pos_SCONJ',
 'pos_VERB',
 'pos_X',
 'dep_ROOT',
 'dep_acl',
 'dep_acomp',
 'dep_advcl',
 'dep_advmod',
 'dep_agent',
 'dep_amod',
 'dep_appos',
 'dep_attr',
 'dep_aux',
 'dep_auxpass',
 #'dep_case',
 'dep_cc',
 'dep_ccomp',
 'dep_compound',
 'dep_conj',
 'dep_csubj',
 'dep_dative',
 'dep_dep',
 'dep_det',
 'dep_dobj',
 'dep_expl',
 'dep_intj',
 'dep_mark',
 #'dep_meta',
 'dep_neg',
 'dep_nmod',
 'dep_npadvmod',
 'dep_nsubj',
 'dep_nsubjpass',
 'dep_nummod',
 'dep_oprd',
 'dep_parataxis',
 'dep_pcomp',
 'dep_pobj',
 'dep_poss',
 'dep_preconj',
 'dep_predet',
 'dep_prep',
 'dep_prt',
 'dep_punct',
 'dep_quantmod',
 'dep_relcl',
 'dep_xcomp',
 'ne_CARDINAL',
 'ne_DATE',
 'ne_EVENT',
 'ne_FAC',
 'ne_GPE',
 'ne_LAW',
 'ne_LOC',
 'ne_MONEY',
 'ne_NORP',
 'ne_ORDINAL',
 'ne_ORG',
 'ne_PERCENT',
 'ne_PERSON',
 'ne_PRODUCT',
 'ne_QUANTITY',
 'ne_TIME',
 'ne_WORK_OF_ART']

# 'ne_LANGUAGE' - separately because for some reason ne_LANGUAGE_C1 and ne_LANGUAGE_C2 aren't in the list

In [29]:
for feat in binary_features_for_training:
  if feat not in dt_ungrouped.columns:
    print(feat)

In [30]:
for feat in binary_features_for_training:
    new_feat = feat + '_context'
    f1, f2, f3, f4 = feat + '_C1', feat + '_C2', feat + '_C3', feat + '_C4'
    dt_ungrouped[new_feat] = dt_ungrouped.apply(lambda row: 1 if 1 in [row[f1],row[f2],row[f3],row[f4]] else 0,
                                                axis=1)

# 'ne_LANGUAGE' - separately because for some reason ne_LANGUAGE_C1 and ne_LANGUAGE_C2 aren't in the list
new_feat = 'ne_LANGUAGE' + '_context'
f3, f4 = 'ne_LANGUAGE' + '_C3', 'ne_LANGUAGE' + '_C4'
dt_ungrouped[new_feat] = dt_ungrouped.apply(lambda row: 1 if 1 in [row[f3],row[f4]] else 0,
                                            axis=1)

### 3.5 Final dataset

In [31]:
dt_clean = dt_ungrouped[['sentence', # not a feature for training
 'outlet', # not a feature for training
 'topic', # not a feature for training
 'type', # not a feature for training
 #'num_sent', # not a feature for training
 'article', # not a feature for training
 'biased_words2',
 #'biased_words3', # not a feature for training
 #'biased_words4', # not a feature for training
 #'biased_words5', # not a feature for training
 'text', # not a feature for training
 'text_low', # not a feature for training
 'pos', # not a feature for training
 'lemma', # not a feature for training
 'lemma_low', # not a feature for training
 'tag', # not a feature for training
 'dep', # not a feature for training
 'is_stop', # not a feature for training
 'glove_vec300_norm',
 'order', # not a feature for training
 'tfidf_art',
 'label3', # not a feature for training
 'label4', # not a feature for training
 'label5', # not a feature for training
 'is_ne', # not a feature for training
 'ne_label', # not a feature for training
 'negative_conc',
 'positive_conc',
 'weak_subj',
 'strong_subj',
 'MRCP_concretness_ratings',
 'MRCP_Imagability_ratings',
 'hyperbolic_terms',
 'attitude_markers',
 'kill_verbs',
 'bias_lexicon',
 'assertive_verbs',
 'factive_verbs',
 'report_verbs',
 'implicative_verbs',
 'hedges',
 'boosters',
 'affect ',
 'posemo ',
 'negemo ',
 'anx ',
 'anger ',
 'sad ',
 'social ',
 'family ',
 'friend ',
 'female ',
 'male ',
 'cogproc ',
 'insight ',
 'cause ',
 'discrep ',
 'tentat ',
 'certain ',
 'differ ',
 'percept ',
 'see ',
 'hear ',
 'feel ',
 'bio ',
 'body ',
 'health ',
 'sexual ',
 'ingest ',
 'drives ',
 'affiliation ',
 'achieve ',
 'power ',
 'reward ',
 'risk ',
 'focuspast ',
 'focuspresent ',
 'focusfuture ',
 'relativ ',
 'motion ',
 'space ',
 'time ',
 'work ',
 'leisure ',
 'home ',
 'money ',
 'relig ',
 'death ',
 'informal ',
 'swear ',
 'netspeak ',
 'assent ',
 'nonflu ',
 'filler ',
 'pos_ADJ',
 'pos_ADP',
 'pos_ADV',
 'pos_AUX',
 'pos_DET',
 'pos_INTJ',
 'pos_NOUN',
 'pos_PRON',
 'pos_PROPN',
 'pos_SCONJ',
 'pos_VERB',
 'pos_X',
 'dep_ROOT',
 'dep_acl',
 'dep_acomp',
 'dep_advcl',
 'dep_advmod',
 'dep_agent',
 'dep_amod',
 'dep_appos',
 'dep_attr',
 'dep_aux',
 'dep_auxpass',
 #'dep_case',
 'dep_cc',
 'dep_ccomp',
 'dep_compound',
 'dep_conj',
 'dep_csubj',
 'dep_dative',
 'dep_dep',
 'dep_det',
 'dep_dobj',
 'dep_expl',
 'dep_intj',
 'dep_mark',
 #'dep_meta',
 'dep_neg',
 'dep_nmod',
 'dep_npadvmod',
 'dep_nsubj',
 'dep_nsubjpass',
 'dep_nummod',
 'dep_oprd',
 'dep_parataxis',
 'dep_pcomp',
 'dep_pobj',
 'dep_poss',
 'dep_preconj',
 'dep_predet',
 'dep_prep',
 'dep_prt',
 'dep_punct',
 'dep_quantmod',
 'dep_relcl',
 'dep_xcomp',
 'ne_CARDINAL',
 'ne_DATE',
 'ne_EVENT',
 'ne_FAC',
 'ne_GPE',
 'ne_LANGUAGE',
 'ne_LAW',
 'ne_LOC',
 'ne_MONEY',
 'ne_NORP',
 'ne_ORDINAL',
 'ne_ORG',
 'ne_PERCENT',
 'ne_PERSON',
 'ne_PRODUCT',
 'ne_QUANTITY',
 'ne_TIME',
 'ne_WORK_OF_ART',
 'negative_conc_context',
 'positive_conc_context',
 'weak_subj_context',
 'strong_subj_context',
 'hyperbolic_terms_context',
 'attitude_markers_context',
 'kill_verbs_context',
 'bias_lexicon_context',
 'assertive_verbs_context',
 'factive_verbs_context',
 'report_verbs_context',
 'implicative_verbs_context',
 'hedges_context',
 'boosters_context',
 'affect _context',
 'posemo _context',
 'negemo _context',
 'anx _context',
 'anger _context',
 'sad _context',
 'social _context',
 'family _context',
 'friend _context',
 'female _context',
 'male _context',
 'cogproc _context',
 'insight _context',
 'cause _context',
 'discrep _context',
 'tentat _context',
 'certain _context',
 'differ _context',
 'percept _context',
 'see _context',
 'hear _context',
 'feel _context',
 'bio _context',
 'body _context',
 'health _context',
 'sexual _context',
 'ingest _context',
 'drives _context',
 'affiliation _context',
 'achieve _context',
 'power _context',
 'reward _context',
 'risk _context',
 'focuspast _context',
 'focuspresent _context',
 'focusfuture _context',
 'relativ _context',
 'motion _context',
 'space _context',
 'time _context',
 'work _context',
 'leisure _context',
 'home _context',
 'money _context',
 'relig _context',
 'death _context',
 'informal _context',
 'swear _context',
 'netspeak _context',
 'assent _context',
 'nonflu _context',
 'filler _context',
 'pos_ADJ_context',
 'pos_ADP_context',
 'pos_ADV_context',
 'pos_AUX_context',
 'pos_DET_context',
 'pos_INTJ_context',
 'pos_NOUN_context',
 'pos_PRON_context',
 'pos_PROPN_context',
 'pos_SCONJ_context',
 'pos_VERB_context',
 'pos_X_context',
 'dep_ROOT_context',
 'dep_acl_context',
 'dep_acomp_context',
 'dep_advcl_context',
 'dep_advmod_context',
 'dep_agent_context',
 'dep_amod_context',
 'dep_appos_context',
 'dep_attr_context',
 'dep_aux_context',
 'dep_auxpass_context',
 #'dep_case_context',
 'dep_cc_context',
 'dep_ccomp_context',
 'dep_compound_context',
 'dep_conj_context',
 'dep_csubj_context',
 'dep_dative_context',
 'dep_dep_context',
 'dep_det_context',
 'dep_dobj_context',
 'dep_expl_context',
 'dep_intj_context',
 'dep_mark_context',
 #'dep_meta_context',
 'dep_neg_context',
 'dep_nmod_context',
 'dep_npadvmod_context',
 'dep_nsubj_context',
 'dep_nsubjpass_context',
 'dep_nummod_context',
 'dep_oprd_context',
 'dep_parataxis_context',
 'dep_pcomp_context',
 'dep_pobj_context',
 'dep_poss_context',
 'dep_preconj_context',
 'dep_predet_context',
 'dep_prep_context',
 'dep_prt_context',
 'dep_punct_context',
 'dep_quantmod_context',
 'dep_relcl_context',
 'dep_xcomp_context',
 'ne_CARDINAL_context',
 'ne_DATE_context',
 'ne_EVENT_context',
 'ne_FAC_context',
 'ne_GPE_context',
 'ne_LAW_context',
 'ne_LOC_context',
 'ne_MONEY_context',
 'ne_NORP_context',
 'ne_ORDINAL_context',
 'ne_ORG_context',
 'ne_PERCENT_context',
 'ne_PERSON_context',
 'ne_PRODUCT_context',
 'ne_QUANTITY_context',
 'ne_TIME_context',
 'ne_WORK_OF_ART_context',
 'ne_LANGUAGE_context']]

In [32]:
print('number of observations:', len(dt_clean))
print('number of unique words:', len(set(dt_clean['text_low'])))
print('number of features:', len(list(dt_clean))-23)
print('number of biased words (t=3):', len(dt_clean[dt_clean['label3']==1]))
print('number of biased words (t=4):', len(dt_clean[dt_clean['label4']==1]))
print('number of biased words (t=5):', len(dt_clean[dt_clean['label5']==1]))

number of observations: 110269
number of unique words: 12523
number of features: 277
number of biased words (t=3): 3736
number of biased words (t=4): 3736
number of biased words (t=5): 3736


In [33]:
dt_final = dt_clean[dt_clean['is_stop']==False]
print('number of observations:', len(dt_final))
print('number of unique words:', len(set(dt_final['text_low'])))
print('number of features:', len(list(dt_final))-23)
print('number of biased words (t=3):', len(dt_final[dt_final['label3']==1]))
print('number of biased words (t=4):', len(dt_final[dt_final['label4']==1]))
print('number of biased words (t=5):', len(dt_final[dt_final['label5']==1]))

number of observations: 65745
number of unique words: 12253
number of features: 277
number of biased words (t=3): 3416
number of biased words (t=4): 3416
number of biased words (t=5): 3416


In [None]:
#dt_clean.to_excel('dt_clean.xlsx')
#dt_final.to_excel('data/dt_final_SG1.xlsx')
dt_final.to_excel('data/dt_final_SG2.xlsx')