## 📒01 | RQ1 Sentinent Discionary Based Analysis using 6 Sentiment Lexicon (Dritsa, 2018)

**Preprocessing steps specifically tailored for this analysis:**\
✅ diacritics removal\
✅ formal phrases and honorifics\
✅ remove extra white space\
✅ tokenize speech\
✅ lemmatize speech\
❌ keep stop-words because included in Drista 2018 6 sentiments lexicon

In [1]:
from collections import defaultdict
import jellyfish

In [3]:
import spacy

In [4]:
rq1_df = pd.read_csv('processed01_par10-20.csv',index_col=0)

In [5]:
rq1_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 341805 entries, 0 to 341804
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        341805 non-null  int64 
 1   member_name       341805 non-null  object
 2   sitting_date      341805 non-null  object
 3   political_party   341805 non-null  object
 4   government        341805 non-null  object
 5   roles             341805 non-null  object
 6   member_gender     341805 non-null  object
 7   speech            341805 non-null  object
 8   year              341805 non-null  int64 
 9   is_government     341805 non-null  int64 
 10  speaker_gov_role  60473 non-null   object
 11  leadership_role   10689 non-null   object
 12  speech_clean      338501 non-null  object
dtypes: int64(3), object(10)
memory usage: 36.5+ MB


Tokenization and lemmatization of Greek text using spaCy; whitespace tokens excluded.

In [59]:
nlp = spacy.load("el_core_news_sm")

def tokenize_and_lemmatize(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc if not token.is_space]

Load Six Sentiments Lexicon for Greek Language by Drista (2018)

In [76]:
# Load lexicon into dictionary
lexicon_df = pd.read_csv("out_lexicon_6sent.csv")

In [77]:
lexicon_df.head()

Unnamed: 0,term,anger,disgust,fear,happiness,sadness,surprise
0,αβαφτιστος,4.0,4.5,1.0,1.0,2.5,4.5
1,Χριστος,4.5,3.75,4.25,4.0,4.0,4.5
2,α,3.75,4.0,4.0,4.0,3.75,4.75
3,αβαπτιστος,4.0,4.5,1.0,1.0,2.5,4.5
4,αβεβαιοτητα,1.0,1.0,2.5,1.0,1.5,1.0


Dictionary-based sentiment lexicon constructed by mapping each term to six emotion intensity scores from the lexicon DataFrame.

In [78]:
lexicon = {
    row['term'].strip(): [
        float(row['anger']),
        float(row['disgust']),
        float(row['fear']),
        float(row['happiness']),
        float(row['sadness']),
        float(row['surprise'])
    ]
    for _, row in lexicon_df.iterrows()
}

Function defined for deriving a 6-dimensional emotion vector by averaging the RMS of emotion scores for lexicon-matched tokens.

In [29]:
def sent6_vec(text, lexicon, tokenize_fn):
    word_vecs = []

    for word in tokenize_fn(text):
        if word in lexicon:
            word_vecs.append(lexicon[word])

    if not word_vecs:
        return [0] * 6  
    
    word_vecs = np.array(word_vecs)

    rms = np.sqrt(np.mean(np.square(word_vecs), axis=0))
    return [round(v, 3) for v in rms]

<b style="color:blue;">Test to a sample</b>

In [80]:
sample_df = rq1_df.sample(n=1000, random_state=32).copy()

In [79]:
print("αγαπημενη" in lexicon)
print(lexicon.get("αγαπημενη"))

True
[1.0, 1.0, 1.0, 4.75, 1.0, 2.75]


In [44]:
sample_df['speech_clean'] = sample_df['speech_clean'].fillna('').astype(str)

In [82]:
sample_df['sent6_vec'] = sample_df['speech_clean'].progress_apply(
    lambda text: sent6_vec(text, lexicon, tokenize_and_lemmatize)
)

  0%|          | 0/1000 [00:00<?, ?it/s]

In [83]:
sample_df[['member_name', 'political_party', 'year', 'roles', 'is_government', 'sent6_vec']].head()


Unnamed: 0,member_name,political_party,year,roles,is_government,sent6_vec
318719,αναγνωστοπουλου πετρου αθανασια (σια),συνασπισμος ριζοσπαστικης αριστερας,2019,['αναπληρωτης υπουργος εξωτερικων(15/02/2019-0...,0,"[3.142, 2.915, 2.016, 1.581, 1.0, 3.142]"
126402,κωνσταντοπουλου ν. ζωη,συνασπισμος ριζοσπαστικης αριστερας,2014,['βουλευτης'],0,"[0, 0, 0, 0, 0, 0]"
321719,μελας παναγιωτη ιωαννης,νεα δημοκρατια,2019,['βουλευτης'],0,"[0, 0, 0, 0, 0, 0]"
257082,βαρδακης δημητριου σωκρατης,συνασπισμος ριζοσπαστικης αριστερας,2017,['βουλευτης'],1,"[2.63, 2.517, 1.744, 2.457, 1.0, 3.131]"
65452,χρυσοχοιδης βασιλειου μιχαηλ,πανελληνιο σοσιαλιστικο κινημα,2012,['υπουργος αναπτυξης ανταγωνιστικοτητας και να...,1,"[3.651, 3.391, 2.363, 1.969, 1.225, 3.582]"


In [84]:
sentiment_cols = ['anger', 'disgust', 'fear', 'happiness', 'sadness', 'surprise']
sentiment_df = pd.DataFrame(sample_df['sent6_vec'].tolist(), columns=sentiment_cols)

In [85]:
sample_df_sent = pd.concat([sample_df, sentiment_df], axis=1)

In [86]:
sample_df_sent[sentiment_cols].describe()

Unnamed: 0,anger,disgust,fear,happiness,sadness,surprise
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,1.769585,1.588783,1.115906,1.067624,0.714237,1.915119
std,1.644246,1.50585,1.040254,0.995051,0.668034,1.72734
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,1.5,1.0,1.0,1.0,2.5
75%,3.261,3.026,2.109,1.969,1.145,3.5
max,4.33,4.0,4.25,4.5,3.5,4.33


6-dimensional emotion vectors computed for all cleaned speeches using the defined function and stored in a new column.

In [90]:
rq1_df['sent6_vec'] = rq1_df['speech_clean'].progress_apply(
    lambda text: sent6_vec(text, lexicon, tokenize_and_lemmatize)
)

  0%|          | 0/341805 [00:00<?, ?it/s]

Emotion vector results stored in a new DataFrame.

In [91]:
sentiment_cols = ['anger', 'disgust', 'fear', 'happiness', 'sadness', 'surprise']
sentiment_redf = pd.DataFrame(rq1_df['sent6_vec'].tolist(), columns=sentiment_cols)

In [92]:
meta_cols = ['member_name', 'political_party', 'year', 'is_government', 'roles']
meta_df = rq1_df[meta_cols].reset_index(drop=True)

In [93]:
rq1_results_df = pd.concat([meta_df, sentiment_redf], axis=1)