# Development of Emotion and Reasoning in the General Speeches of the United Nations: A text-based machine learning approach
## Script 3: Tables

### Instructions BEFORE running this script:
- Ensure you ran Script 0 till 2 before completely to create proper folder structure and to get the required data

### Description: 
#### This file creates the following tables:

Descriptive Summary Statistics
- Summary Statistics
- Frequency of Categorial Variables

Emotionality Scoring
- Emotionality Scoring per Decade
- Emotionality Scoring - Subsamples
- Emotionality Scoring - Categorial Variables
- Emotionality Scoring - Position (1994-2024)

Other Tables
- T-Test Subsamples
- Years with more than 5 female speakers
- Number of (Unique) Tokens
- Speeches with the highest and lowest emotionality score
- Speeches of permanent members of the security council with the highest and lowest emotionality score

Appendix
- Yearly Emotionality Scores
- Year with a change in the emotionality score by over 0.08
- Two speeches with the lowest and highest score (Fully printed)

In [65]:
# == Import libraries for data processing and visualization ==
import joblib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from scipy import stats
from tabulate import tabulate
import re
from pathlib import Path
import ast

# === Set Working Directory ===

# --- Set base path to project root ---
base_path = Path.cwd().parent  # project root
print(f"Project root set to: {base_path}")

# === Define Folder Paths ===
data_c = base_path / "data"
data_results = os.path.join(data_c, 'results')
data_temp = os.path.join(data_c, 'temp')
data_freq = os.path.join(data_c, 'freq')
data_stopwords = os.path.join(data_c, 'stopwords')
data_dict = data_c / "dictionaries"
tables_dir = os.path.join(base_path, 'tables')

# === Load data ===

os.chdir(tables_dir)
un_corpus_scored = pd.read_csv(
    os.path.join(data_results, "un_corpus_scored.csv"),
    sep=';', 
    encoding='utf-8'
)

stopwords = joblib.load(os.path.join(data_stopwords, "stopwords.pkl"))
procedural_words = joblib.load(os.path.join(data_stopwords, "procedural_words.pkl"))
affect_dic = joblib.load(data_dict / 'dictionary_affect.pkl')
cognition_dic = joblib.load(data_dict / 'dictionary_cognition.pkl')

Project root set to: C:\Users\sarah\Downloads\TESTRUN


----

## Descriptive Summary Statistics

In [5]:
# Create seperate dummies on the position variable to get a nice summary table

position_nonmissing = un_corpus_scored['position'].notna()

position_dummies = pd.get_dummies(un_corpus_scored.loc[position_nonmissing, 'position'])

position_dummies = position_dummies.astype(int)

position_dummies = position_dummies.reindex(un_corpus_scored.index)

position_dummies.loc[~position_nonmissing, :] = pd.NA

position_dummies = position_dummies.astype("Int64")

un_corpus_scored = pd.concat([un_corpus_scored, position_dummies], axis=1)

### Table: Summary Statistics (All Variables) (Obs, Mean, SD, Min, Max)

In [9]:
all_numeric_vars = ['year', 'speech_length_words', 'english_official_language',
                    'security_council_permanent', 'gender_dummy'] + list(position_dummies.columns)

perm_members = ["RUS", "FRA", "GBR", "USA", "CHN"]
for c in perm_members:
    un_corpus_scored[f"perm_{c}"] = (un_corpus_scored["country_code"] == c).astype(int)
perm_dummies = [f"perm_{c}" for c in perm_members]

sc_index = all_numeric_vars.index("security_council_permanent") + 1
all_numeric_vars = (
    all_numeric_vars[:sc_index] +
    perm_dummies +
    all_numeric_vars[sc_index:]
)

summary_table = pd.DataFrame({
    "Variable": all_numeric_vars,
    "N": un_corpus_scored[all_numeric_vars].count().astype(int),
    "Mean": un_corpus_scored[all_numeric_vars].mean().round(3),
    "SD": un_corpus_scored[all_numeric_vars].std().round(3),
    "Min": un_corpus_scored[all_numeric_vars].min(),
    "Max": un_corpus_scored[all_numeric_vars].max()
})

position_header = pd.DataFrame({
    "Variable": ["Position"],
    "N": [""],
    "Mean": [""],
    "SD": [""],
    "Min": [""],
    "Max": [""]
})

sc_header = pd.DataFrame({
    "Variable": ["Permanent Members of the Security Council"],
    "N": [""],
    "Mean": [""],
    "SD": [""],
    "Min": [""],
    "Max": [""]
})

sc_loc = summary_table.index.get_indexer(summary_table.index[summary_table["Variable"] == "security_council_permanent"])[0] + 1

summary_table = pd.concat([
    summary_table.iloc[:sc_loc],
    sc_header,
    summary_table.iloc[sc_loc:]
]).reset_index(drop=True)

insert_idx = summary_table.index[summary_table["Variable"] == "gender_dummy"][0] + 1
summary_table = pd.concat([summary_table.iloc[:insert_idx],
                           position_header,
                           summary_table.iloc[insert_idx:]]).reset_index(drop=True)

var_labels = {
    "year": "Year",
    "speech_length_words": "Number of Words",
    "english_official_language": "English as Official Language (Yes = 1)",
    "security_council_permanent": "Permanent Membership of Security Council (Yes = 1)",
    "gender_dummy": "Gender (Female = 1)",
    "(Deputy) Minister for Foreign Affairs": "&nbsp;&nbsp;&nbsp;&nbsp;(Deputy) Minister for Foreign Affairs",
    "(Deputy) Prime Minister": "&nbsp;&nbsp;&nbsp;&nbsp;(Deputy) Prime Minister",
    "(Vice-) President": "&nbsp;&nbsp;&nbsp;&nbsp;(Vice-) President",
    "Diplomatic Representative": "&nbsp;&nbsp;&nbsp;&nbsp;Diplomatic Representative",
    "Others": "&nbsp;&nbsp;&nbsp;&nbsp;Others",

    "perm_RUS": "&nbsp;&nbsp;&nbsp;&nbsp;Russia",
    "perm_FRA": "&nbsp;&nbsp;&nbsp;&nbsp;France",
    "perm_GBR": "&nbsp;&nbsp;&nbsp;&nbsp;United Kingdom",
    "perm_USA": "&nbsp;&nbsp;&nbsp;&nbsp;United States",
    "perm_CHN": "&nbsp;&nbsp;&nbsp;&nbsp;China",

    "Permanent Members of the Security Council": "Permanent Members of the Security Council"
}

summary_table['Variable'] = summary_table['Variable'].replace(var_labels)

numeric_cols = ['Mean','SD','Min','Max']
summary_table[numeric_cols] = summary_table[numeric_cols].replace("", pd.NA)

summary_table[['Min', 'Max']] = summary_table[['Min', 'Max']].astype('Int64')

styled_table = summary_table.style \
    .hide(axis="index") \
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'), 
            ('color', 'black'),
            ('font-weight', 'bold'),
            ('text-align', 'center'),
            ('background-color', 'white')
        ]},
        {'selector': 'th.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col1', 'props': [('min-width', '80px')]},
        {'selector': 'td.col2', 'props': [('min-width', '80px')]},
        {'selector': 'td.col3', 'props': [('min-width', '80px')]},
        {'selector': 'td.col4', 'props': [('min-width', '80px')]},
        {'selector': 'td.col5', 'props': [('min-width', '80px')]}
    ]) \
    .set_properties(**{'text-align': 'center'}, subset=['N','Mean','SD','Min','Max']) \
    .format({"Mean": "{:.3f}", "SD": "{:.3f}"})

note_text = "Note: Gender and position information comes from supplementary data and is not available for all speeches."

# --- EXPORT HTML ---
html_table = styled_table.to_html()
html_table_with_note = html_table + f'<p style="text-align:center; font-style:italic;">{note_text}</p>'
with open("Summary_Statistics_Table.html", "w", encoding="utf-8") as f:
    f.write(html_table_with_note)

# --- EXPORT LaTeX ---
latex_ready = summary_table.copy()
latex_ready["Variable"] = latex_ready["Variable"].apply(
    lambda x: re.sub(r"&nbsp;+", r"\\hspace*{1em}", str(x)) if isinstance(x, str) else x
)

latex_table = latex_ready.to_latex(
    index=False,
    na_rep="",
    float_format="%.3f",
    column_format="lrrrrr",
    caption="Summary Statistics",
    label="tab:summary_stats",
    header=["Variable", "Obs", "Mean", "SD", "Min", "Max"],
    bold_rows=False,
    escape=False  
)

latex_table = latex_table.replace(
    "\\toprule",
    "\\hline\\hline"
).replace(
    "\\midrule",
    "\\hline"
).replace(
    "\\bottomrule",
    "\\hline\\hline"
)

note_text = "Note: Gender and position information comes from supplementary data and is not available for all speeches."

latex_table_with_note = latex_table.replace(
    r"\end{tabular}",
    rf"\end{{tabular}}\\[1ex] {{\centering \textit{{{note_text}}} \\}}"
)


with open("Summary_Variables.tex", "w", encoding="utf-8") as f:
    f.write(latex_table_with_note)

styled_table

Variable,N,Mean,SD,Min,Max
Year,10952.0,1993.296,20.186,1946.0,2024.0
Number of Words,10952.0,2913.75,1502.019,423.0,22003.0
English as Official Language (Yes = 1),10952.0,0.239,0.426,0.0,1.0
Permanent Membership of Security Council (Yes = 1),10952.0,0.035,0.185,0.0,1.0
Permanent Members of the Security Council,,,,,
Russia,10952.0,0.007,0.084,0.0,1.0
France,10952.0,0.007,0.082,0.0,1.0
United Kingdom,10952.0,0.007,0.085,0.0,1.0
United States,10952.0,0.007,0.085,0.0,1.0
China,10952.0,0.007,0.084,0.0,1.0


### Table: Counts and Shares of Categorial Variables

In [11]:
# --- Ensure numeric dummies ---
#un_corpus_scored['gender_dummy'] = pd.to_numeric(un_corpus_scored['gender_dummy'], errors='coerce')
#un_corpus_scored['english_official_language'] = pd.to_numeric(un_corpus_scored['english_official_language'], errors='coerce')
#un_corpus_scored['security_council_permanent'] = pd.to_numeric(un_corpus_scored['security_council_permanent'], errors='coerce')

# Create Dummies for permanent members of the security council
perm_members = ["RUS", "FRA", "GBR", "USA", "CHN"]
perm_labels = ["Russia", "France", "United Kingdom", "United States", "China"]
for c in perm_members:
    un_corpus_scored[f"perm_{c}"] = (un_corpus_scored["country_code"] == c).astype(int)

rows = []

gender_available = un_corpus_scored['gender_dummy'].notna().sum()
gender_counts = un_corpus_scored['gender_dummy'].value_counts(dropna=True)
rows.append(['Gender', '', gender_available, ''])
rows.append(['', 'Male', gender_counts.get(0, 0), f"{gender_counts.get(0,0)/gender_available:.1%}"])
rows.append(['', 'Female', gender_counts.get(1, 0), f"{gender_counts.get(1,0)/gender_available:.1%}"])

positions = list(position_dummies.columns)
position_available = un_corpus_scored[positions].notna().all(axis=1).sum()
position_counts = un_corpus_scored[positions].sum()
rows.append(['Position', '', position_available, ''])
for pos in positions:
    rows.append(['', pos, position_counts[pos], f"{position_counts[pos]/position_available:.1%}"])

sc_available = un_corpus_scored['security_council_permanent'].notna().sum()
sc_counts = un_corpus_scored['security_council_permanent'].value_counts(dropna=True)
rows.append(['Permanent Membership of the Security Council', '', sc_available, ''])
rows.append(['', 'No', sc_counts.get(0,0), f"{sc_counts.get(0,0)/sc_available:.1%}"])
rows.append(['', 'Yes', sc_counts.get(1,0), f"{sc_counts.get(1,0)/sc_available:.1%}"])

sc_country_counts = [un_corpus_scored.get(f'perm_{c}', pd.Series(0)).sum() for c in perm_members]
total_p5_count = sum(sc_country_counts)
rows.append(['Permanent Members of the Security Council', '', total_p5_count, ''])
for label, cnt in zip(perm_labels, sc_country_counts):
    rows.append(['', label, cnt, f"{cnt/sc_available:.1%}"])

eng_available = un_corpus_scored['english_official_language'].notna().sum()
eng_counts = un_corpus_scored['english_official_language'].value_counts(dropna=True)
rows.append(['English as Official Language', '', eng_available, ''])
rows.append(['', 'No', eng_counts.get(0,0), f"{eng_counts.get(0,0)/eng_available:.1%}"])
rows.append(['', 'Yes', eng_counts.get(1,0), f"{eng_counts.get(1,0)/eng_available:.1%}"])

summary_hierarchical = pd.DataFrame(rows, columns=['Category', 'Subcategory', 'N', 'Share'])

# --- HTML ---
def top_border_html(row):
    return ['border-top:1px solid black;' if row['Category'] != '' else '' for _ in row]

styled_table = summary_hierarchical.style \
    .hide(axis="index") \
    .set_table_styles([
        {'selector':'th','props':[('border-bottom','3px solid black'),
                                  ('text-align','center'),
                                  ('font-weight','bold')]},
        {'selector':'td.col0','props':[('text-align','left')]},
        {'selector':'td.col1','props':[('text-align','left')]},
        {'selector':'td','props':[('text-align','center')]}
    ]) \
    .apply(top_border_html, axis=1)


html_file = "Frequencies_Cat_Var_.html"
styled_table.to_html(html_file)

# --- LaTeX ---
latex_ready = summary_hierarchical.copy()

latex_ready['Subcategory'] = latex_ready.apply(
    lambda x: '\\hspace*{1em}' + str(x['Subcategory']) if x['Category']=='' else x['Subcategory'], axis=1
)
latex_ready['Share'] = latex_ready['Share'].str.replace('%', r'\%', regex=False)

latex_table = latex_ready.to_latex(
    index=False,
    na_rep='',
    column_format='p{6cm}lrr',
    caption="Frequencies of Categorial Variables",
    label="tab:frequencies_cat_var_summary",
    escape=False
)

latex_table = latex_table.replace("\\toprule","\\hline\\hline") \
                         .replace("\\midrule","") \
                         .replace("\\bottomrule","\\hline\\hline")

lines = latex_table.splitlines()
category_rows = summary_hierarchical[summary_hierarchical['Category'] != ''].index.tolist()
for idx in category_rows[::-1]:  # reverse so insertion doesn't shift lines
    for i, l in enumerate(lines):
        if re.match(rf"{summary_hierarchical.loc[idx,'Category']}", l):
            lines.insert(i, '\\hline')  # insert above
            break
latex_table = "\n".join(lines)

note_text = "Note: Gender, position, and Security Council information comes from supplementary data and is not available for all speeches."
latex_table_with_note = latex_table.replace(
    "\\end{tabular}",
    f"\\end{{tabular}}\n\\\\[1ex] {{\\centering \\textit{{{note_text}}} \\\\}}"
)

latex_file = "Frequencies_Cat_Var_Table.tex"
with open(latex_file, "w", encoding="utf-8") as f:
    f.write(latex_table_with_note)

styled_table

Category,Subcategory,N,Share
Gender,,4704,
,Male,4521,96.1%
,Female,183,3.9%
Position,,6273,
,(Deputy) Minister for Foreign Affairs,2387,38.1%
,(Deputy) Prime Minister,1239,19.8%
,(Vice-) President,2060,32.8%
,Diplomatic Representative,339,5.4%
,Others,248,4.0%
Permanent Membership of the Security Council,,10952,


-----

## Tables Emotionality Scoring

### Table: Emotionality Scoring - per Decade

In [17]:
decade_start = (np.floor((un_corpus_scored['year'] - 1946) / 10) * 10 + 1946).astype(int)
decade_end = decade_start + 9
decade_end = decade_end.where(decade_end < 2024, 2024)

un_corpus_scored['Decade'] = decade_start.astype(str) + "–" + decade_end.astype(str)

decade_summary = (
    un_corpus_scored.groupby('Decade')['score']
    .agg(Obs='count', Mean='mean', SD='std', Min='min', Max='max')
    .reset_index()
)

numeric_cols = ['Mean', 'SD', 'Min', 'Max']
decade_summary[numeric_cols] = decade_summary[numeric_cols].round(3)

styled_decade_table = (
    decade_summary.style
    .hide(axis="index")
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'),
            ('color', 'black'),
            ('font-weight', 'bold'),
            ('text-align', 'center'),
            ('background-color', 'white')
        ]},
        {'selector': 'th.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col1', 'props': [('min-width', '80px')]},
        {'selector': 'td.col2', 'props': [('min-width', '80px')]},
        {'selector': 'td.col3', 'props': [('min-width', '80px')]},
        {'selector': 'td.col4', 'props': [('min-width', '80px')]},
        {'selector': 'td.col5', 'props': [('min-width', '80px')]}
    ])
    .set_properties(**{'text-align': 'center'}, subset=['Obs', 'Mean', 'SD', 'Min', 'Max'])
    .format({
        'Obs': '{:.0f}',
        'Mean': '{:.3f}',
        'SD': '{:.3f}',
        'Min': '{:.3f}',
        'Max': '{:.3f}'
    }, na_rep='-')
)

# --- EXPORT HTML ---
styled_decade_table.to_html("Scoring_per_Decade.html")

# --- EXPORT LaTeX ---

latex_table = decade_summary.to_latex(
    index=False,
    na_rep="",
    float_format="%.3f",
    column_format="lrrrrr",
    escape=False
)

latex_table = (
    latex_table.replace("\\toprule", "\\hline\\hline")
               .replace("\\midrule", "\\hline")
               .replace("\\bottomrule", "\\hline\\hline")
)

latex_str = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{Emotionality Scoring by Decade}\n"
    "\\label{tab:summary_decade}\n"
    + latex_table +
    "\n\\end{table}"
)

with open("Scoring_per_Decade.tex", "w", encoding="utf-8") as f:
    f.write(latex_str)

styled_decade_table

Decade,Obs,Mean,SD,Min,Max
1946–1955,421,0.849,0.178,0.499,1.415
1956–1965,833,0.853,0.188,0.448,1.365
1966–1975,1132,0.866,0.204,0.427,1.555
1976–1985,1436,0.841,0.187,0.436,1.542
1986–1995,1618,0.768,0.187,0.362,1.494
1996–2005,1840,0.761,0.21,0.315,1.66
2006–2015,1927,0.791,0.214,0.317,1.511
2016–2024,1745,0.83,0.198,0.338,1.502


### Table: Emotionality Scoring - Subsamples

In [20]:
entire = un_corpus_scored["score"]

gender_sample = un_corpus_scored.loc[
    un_corpus_scored["gender_dummy"].notna(),
    "score"
]

position_sample = un_corpus_scored.loc[
    un_corpus_scored["position"].notna(),
    "score"
]

p5_sample = un_corpus_scored.loc[
    un_corpus_scored["security_council_permanent"] == 1,
    "score"
]

def summarize(series, name):
    return pd.DataFrame({
        "Sample": [name],
        "N": [series.count()],
        "Mean": [round(series.mean(),3)],
        "SD": [round(series.std(),3)],
        "Min": [round(series.min(),3)],
        "Max": [round(series.max(),3)]
    })

summary = pd.concat([
    summarize(entire, "Entire Sample"),
    summarize(gender_sample, "Gender Sample"),
    summarize(position_sample, "Position Sample"),
    summarize(p5_sample, "Permanent Members of the Security Council Sample")
], ignore_index=True)

summary = summary[["Sample", "N", "Mean", "SD", "Min", "Max"]]


# --- EXPORT HTML ---

html_file = "Scoring_Subsamples.html"

styled = summary.style \
    .hide(axis="index") \
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'),
            ('font-weight', 'bold'),
            ('text-align', 'center')
        ]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},  # left-align Sample column
        {'selector': 'td', 'props': [('text-align', 'center')]}      # center numeric columns
    ]) \
    .format({
        "Mean": "{:.3f}",
        "SD": "{:.3f}",
        "Min": "{:.3f}",
        "Max": "{:.3f}"
    })

styled.to_html(html_file)

# --- EXPORT LaTeX -

latex_table = summary.to_latex(
    index=False,
    column_format="lrrrrr",
    #caption="Scoring Subsamples",
    #label="tab:scoring_subsamples",
    float_format="%.3f",
    escape=False
)

latex_table = (
    latex_table.replace("\\toprule","\\hline\\hline")
               .replace("\\midrule","\\hline")
               .replace("\\bottomrule","\\hline\\hline")
)

latex_str = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{Emotionality Scoring - Subsamples}\n"
    "\\label{tab:scoring_subsamples}\n"
    + latex_table +
    "\n\\end{table}"
)

latex_file = "Scoring_Subsamples.tex"
with open(latex_file, "w", encoding="utf-8") as f:
    f.write(latex_str)


summary

Unnamed: 0,Sample,N,Mean,SD,Min,Max
0,Entire Sample,10952,0.81,0.203,0.315,1.66
1,Gender Sample,4704,0.822,0.196,0.338,1.555
2,Position Sample,6273,0.794,0.209,0.315,1.66
3,Permanent Members of the Security Council Sample,388,0.842,0.211,0.397,1.491


### Table: Emotionaly Scoring - Categorial Variables

In [63]:
perm_members = ["RUS", "FRA", "GBR", "USA", "CHN"]
perm_labels = ["Russia", "France", "United Kingdom", "United States", "China"]

for c in perm_members:
    un_corpus_scored[f"perm_{c}"] = (un_corpus_scored["country_code"] == c).astype(int)

summary_list = []

overall_row = pd.DataFrame({
    'Variable': ['Overall'],
    'N': [un_corpus_scored['score'].count()],
    'Mean': [un_corpus_scored['score'].mean()],
    'SD': [un_corpus_scored['score'].std()],
    'Min': [un_corpus_scored['score'].min()],
    'Max': [un_corpus_scored['score'].max()]
})

summary_list.append(overall_row)

summary_list.append(pd.DataFrame({
    'Variable': ['English as Official Language'],
    'N': [''], 'Mean': [''], 'SD': [''], 'Min': [''], 'Max': ['']
}))
for val in sorted(un_corpus_scored['english_official_language'].dropna().unique()):
    subset = un_corpus_scored[un_corpus_scored['english_official_language'] == val]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{'Yes (=1)' if val==1 else 'No (=0)'}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

summary_list.append(pd.DataFrame({
    'Variable': ['Permanent Membership of the Security Council'],
    'N': [''], 'Mean': [''], 'SD': [''], 'Min': [''], 'Max': ['']
}))
for val in sorted(un_corpus_scored['security_council_permanent'].dropna().unique()):
    subset = un_corpus_scored[un_corpus_scored['security_council_permanent'] == val]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{'Yes (=1)' if val==1 else 'No (=0)'}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

summary_list.append(pd.DataFrame({
    'Variable': ['Permanent Members of the Security Council'],
    'N': [''], 'Mean': [''], 'SD': [''], 'Min': [''], 'Max': ['']
}))
for label, c in zip(perm_labels, perm_members):
    subset = un_corpus_scored[un_corpus_scored[f'perm_{c}']==1]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{label}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

summary_list.append(pd.DataFrame({
    'Variable': ['Gender'],
    'N': [''], 'Mean': [''], 'SD': [''], 'Min': [''], 'Max': ['']
}))
for val in sorted(un_corpus_scored['gender_dummy'].dropna().unique()):
    subset = un_corpus_scored[un_corpus_scored['gender_dummy']==val]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{'Female (=1)' if val==1 else 'Male (=0)'}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

summary_list.append(pd.DataFrame({
    'Variable': ['Position'],
    'N': [''], 'Mean': [''], 'SD': [''], 'Min': [''], 'Max': ['']
}))

for pos in position_dummies.columns:
    subset = un_corpus_scored[un_corpus_scored[pos]==1]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{pos}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

score_summary_table = pd.concat(summary_list, ignore_index=True)

numeric_cols = ['Mean', 'SD', 'Min', 'Max']
score_summary_table[numeric_cols] = score_summary_table[numeric_cols].round(3)
score_summary_table[numeric_cols] = score_summary_table[numeric_cols].replace("", pd.NA)

styled_score_table = (
    score_summary_table.style
    .hide(axis="index")
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'),
            ('color', 'black'),
            ('font-weight', 'bold'),
            ('text-align', 'center'),
            ('background-color', 'white')
        ]},
        {'selector': 'th.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ])
    .set_properties(**{'text-align': 'center'}, subset=['N','Mean','SD','Min','Max'])
    .format({col: "{:.3f}" for col in numeric_cols})
)

# --- EXPORT HTML ---
styled_score_table.to_html("Scoring_categorial_variable.html")

# --- EXPORT LaTeX ---
latex_table = score_summary_table.copy()
latex_table['Variable'] = latex_table['Variable'].apply(
    lambda x: str(x).replace("&nbsp;&nbsp;&nbsp;&nbsp;", "\\hspace*{1em}") if isinstance(x, str) else x
)

latex_str = latex_table.to_latex(
    index=False,
    na_rep="",
    float_format="%.3f",
    column_format="p{7cm}rrrrr",
    #caption="Emotionality Scoring for the categorial variables",
   # label="tab:conditional_vars",
    escape=False
)

latex_str = latex_str.replace("\\toprule", "\\hline\\hline") \
                     .replace("\\midrule", "\\hline") \
                     .replace("\\bottomrule", "\\hline\\hline")

latex_str_centered = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"   
    "\\caption{Emotionality Scoring for the categorical variables}\n"
    "\\label{tab:conditional_vars}\n"
    + latex_str +
    "\n\\end{table}"
)


with open("Scoring_categorial_variable.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_centered)

styled_score_table

Variable,N,Mean,SD,Min,Max
Overall,10952.0,0.81,0.203,0.315,1.66
English as Official Language,,,,,
No (=0),8339.0,0.807,0.206,0.315,1.66
Yes (=1),2613.0,0.82,0.19,0.388,1.627
Permanent Membership of the Security Council,,,,,
No (=0),10564.0,0.809,0.202,0.315,1.66
Yes (=1),388.0,0.842,0.211,0.397,1.491
Permanent Members of the Security Council,,,,,
Russia,78.0,0.738,0.144,0.43,1.027
France,74.0,0.823,0.175,0.5,1.352


### Table: Emotionality Scoring - Position (From 1994)

In [26]:
un_corpus_scored_since_1994 = un_corpus_scored[un_corpus_scored['year'] >= 1994]

position_vars = list(position_dummies.columns)

summary_list = []

overall_subset = un_corpus_scored_since_1994['score']
overall_row = pd.DataFrame({
    'Variable': ['Overall Sample since 1994'],
    'N': [overall_subset.count()],
    'Mean': [overall_subset.mean()],
    'SD': [overall_subset.std()],
    'Min': [overall_subset.min()],
    'Max': [overall_subset.max()]
})
summary_list.append(overall_row)

position_header = pd.DataFrame({
    'Variable': ['Position since 1994'],
    'N': [""],
    'Mean': [""],
    'SD': [""],
    'Min': [""],
    'Max': [""]
})
summary_list.append(position_header)

for var in position_vars:
    subset = un_corpus_scored_since_1994[un_corpus_scored_since_1994[var] == 1]
    summary_list.append(pd.DataFrame({
        'Variable': [f"&nbsp;&nbsp;&nbsp;&nbsp;{var_labels.get(var, var)}"],
        'N': [subset['score'].count()],
        'Mean': [subset['score'].mean()],
        'SD': [subset['score'].std()],
        'Min': [subset['score'].min()],
        'Max': [subset['score'].max()]
    }))

score_summary_table = pd.concat(summary_list, ignore_index=True)

numeric_cols = ['Mean', 'SD', 'Min', 'Max']
score_summary_table[numeric_cols] = score_summary_table[numeric_cols].round(3)
score_summary_table[numeric_cols] = score_summary_table[numeric_cols].replace("", pd.NA)

styled_score_table = (
    score_summary_table.style
    .hide(axis="index")
    .set_table_styles([
         {'selector': 'table', 'props': [('width', '100%')]}, 
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'),
            ('color', 'black'),
            ('font-weight', 'bold'),
            ('text-align', 'center'),
            ('background-color', 'white')
        ]},
        {'selector': 'th.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ])
    .set_properties(**{'text-align': 'center'}, subset=['N','Mean','SD','Min','Max'])
    .format({col: "{:.3f}" for col in numeric_cols})
)


# --- EXPORT HTML ---
styled_score_table.to_html("Scoring_positions_from1994.html")

# --- EXPORT LaTeX ---
latex_table = score_summary_table.copy()
latex_table['Variable'] = latex_table['Variable'].apply(
    lambda x: str(x).replace("&nbsp;&nbsp;&nbsp;&nbsp;", "\\hspace*{1em}") if isinstance(x, str) else x
)

# Prepare tabular only
tabular_only = latex_table.to_latex(
    index=False,
    na_rep="",
    float_format="%.3f",
    column_format="lrrrrr",
    escape=False
)

# Replace default rules with hlines
tabular_only = tabular_only.replace("\\toprule", "\\hline\\hline") \
                           .replace("\\midrule", "\\hline") \
                           .replace("\\bottomrule", "\\hline\\hline")

# Wrap in standard table environment without resizing
latex_str_centered = (
    "\\begin{table}[ht]\n"
    "\\centering\n"
    "\\caption{Emotionality Scoring for positions from 1994 till 2024}\n"
    "\\label{tab:positions_1994}\n"
    + tabular_only +
    "\n\\end{table}"
)

with open("Scoring_positions_from1994.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_centered)

styled_score_table

Variable,N,Mean,SD,Min,Max
Overall Sample since 1994,5862.0,0.79,0.209,0.315,1.66
Position since 1994,,,,,
(Deputy) Minister for Foreign Affairs,2371.0,0.753,0.194,0.315,1.66
(Deputy) Prime Minister,1108.0,0.795,0.214,0.317,1.498
(Vice-) President,1834.0,0.838,0.211,0.351,1.511
Diplomatic Representative,339.0,0.742,0.209,0.347,1.627
Others,191.0,0.844,0.227,0.472,1.502


## Other Tables 

## T-Test Subsamples

In [77]:
test_vars = ['gender_dummy', 'position']
table_labels = {'gender_dummy': 'Gender', 'position': 'Position'}

summary_list = []

for var in test_vars:
    scores = un_corpus_scored['score']

    group_non_missing = scores[un_corpus_scored[var].notna()]
    group_missing = scores[un_corpus_scored[var].isna()]

    mean_non_missing = round(group_non_missing.mean(), 3)
    mean_missing = round(group_missing.mean(), 3)

    t_stat, p_val = stats.ttest_ind(group_non_missing, group_missing, nan_policy='omit')
    t_stat = round(t_stat, 3)
    p_val = round(p_val, 3)  

    summary_list.append({
        'Variable': table_labels[var],
        'N (Non-Missing)': len(group_non_missing),
        'N (Missing)': len(group_missing),
        'Mean (Non-Missing)': mean_non_missing,
        'Mean (Missing)': mean_missing,
        't-test': t_stat,
        'p-value': p_val
    })

summary_df = pd.DataFrame(summary_list)

styled_table = (
    summary_df.style
    .hide(axis="index")
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('border-bottom', '3px solid black'), 
            ('color', 'black'),
            ('font-weight', 'bold'),
            ('text-align', 'center'),
            ('background-color', 'white')
        ]},
        {'selector': 'th.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td.col0', 'props': [('text-align', 'left')]},
        {'selector': 'td', 'props': [('text-align', 'center')]},
    ])
    .set_properties(**{'text-align': 'center'})
    .format({
        "Obs (Missing)": "{:.0f}", 
         "Mean (Non-Missing)": "{:.3f}", 
        "Mean (Missing)": "{:.3f}", 
        "t-test": "{:.3f}", 
        "p-value": "{:.3f}"
    }, na_rep="-")
)

# --- Export HTML ---
styled_table.to_html("TTest_Scoring_Gender_Position.html")

# --- Export LaTeX ---

latex_ready = summary_df.copy()

latex_ready.columns = [
    'Variable',
    '\\makecell{Obs \\\\ (Non-Missing)}',
    '\\makecell{Obs \\\\ (Missing)}',
    '\\makecell{Mean \\\\ (Non-Missing)}',
    '\\makecell{Mean \\\\ (Missing)}',
    't-test',
    'p-value'
]

latex_table = latex_ready.to_latex(
    index=False,
    na_rep="",
    float_format="%.3f",
    column_format="lrrrrrr",
   # caption="T-Test of Subsample for the Emotionality Scoring",
   # label="tab:summary_stats_ttest",
    header=["Variable", "Obs (Non-Missing)", "Obs (Missing)", 
            "Mean (Non-Missing)", "Mean (Missing)", "t-test", "p-value"],
    bold_rows=False,
    escape=False
)

latex_table = (
    latex_table.replace("\\toprule", "\\hline\\hline")
               .replace("\\midrule", "\\hline")
               .replace("\\bottomrule", "\\hline\\hline")
)

latex_table_centered = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{T-Test of Subsample for the Emotionality Scoring}\n"
    "\\label{tab:summary_stats_ttest}\n"
    + latex_table +
    "\n\\end{table}"
)

with open("TTest_Scoring_Gender_Position.tex", "w", encoding="utf-8") as f:
    f.write(latex_table_centered)

styled_table

Variable,N (Non-Missing),N (Missing),Mean (Non-Missing),Mean (Missing),t-test,p-value
Gender,4704,6248,0.822,0.801,5.253,0.0
Position,6273,4679,0.794,0.831,-9.534,0.0


### Table: Years with more than 5 female speakers

In [33]:
female_threshold = 5

counts = un_corpus_scored.groupby('year')['gender_dummy'].value_counts().unstack(fill_value=0)

filtered_counts = counts[counts.get(1, 0) > female_threshold].copy()

filtered_years = pd.DataFrame({
    'Year': filtered_counts.index,
    'Male': filtered_counts.get(0, 0).values,
    'Female': filtered_counts.get(1, 0).values
})

# --- HTML ---
html_table = filtered_years.to_html(index=False)
with open("Female_Speeches_Years.html", "w", encoding="utf-8") as f:
    f.write(html_table)

# --- LaTeX ---
tabular_only = filtered_years.to_latex(
    index=False,
    column_format="lrr",
    na_rep="0",
    escape=False
)

tabular_only = tabular_only.replace("\\toprule", "\\hline\\hline") \
                           .replace("\\midrule", "\\hline") \
                           .replace("\\bottomrule", "\\hline\\hline")

latex_str_resized = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{Years with More Than 5 Female Speakers}\n"
    "\\label{tab:female_years}\n"
    + tabular_only +
    "\n\\end{table}"
)

with open("Female_Speeches_Years.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_resized)

filtered_years

Unnamed: 0,Year,Male,Female
0,1993,152,8
1,1994,153,8
2,1995,157,9
3,2006,124,15
4,2015,134,17
5,2016,165,18
6,2017,163,19
7,2018,165,18
8,2019,166,15
9,2020,165,8


### Table: Number of (Unique) Tokens

In [36]:
# --- Tokenize 'speech' column --- (Temporarily)
un_corpus_scored["speech_tokenized"] = un_corpus_scored["speech"].apply(
    lambda x: x.split() if isinstance(x, str) else []
)

print(un_corpus_scored["speech_tokenized"].head())

0    [At, the, resumption, of, the, first, session,...
1    [The, General, Assembly, of, the, United, Nati...
2    [The, principal, organs, of, the, United, Nati...
3    [As, more, than, a, year, has, elapsed, since,...
4    [Coming, to, this, platform, where, so, many, ...
Name: speech_tokenized, dtype: object


In [37]:
def to_list(val):
    if isinstance(val, list):
        return val
    elif isinstance(val, str):
        try:
            return ast.literal_eval(val)
        except:
            return []
    else:
        return []

columns = ["speech_tokenized", "speech_preprocessed", "speech_final"]
summary = []

for col in columns:
    all_tokens = []
    for val in un_corpus_scored[col].dropna():
        tokens = to_list(val)  # convert string to list if needed
        all_tokens.extend(tokens)
    total_tokens = len(all_tokens)
    unique_tokens = len(set(all_tokens))
    summary.append([col, total_tokens, unique_tokens])

table_names = [
    "Raw Speech",
    "Preprocessed Speech",
    "Final Speech (Frequency > 10)"
]

summary_df = pd.DataFrame(summary, columns=["Column", "Total Tokens", "Total Unique Tokens"])
summary_df.insert(0, "Speech Type", table_names)
summary_df = summary_df.drop(columns=["Column"])

# --- LaTeX ---
summary_df.iloc[0, 0] = "Raw Speech"
summary_df.iloc[1, 0] = "Preprocessed Speech"
summary_df.iloc[2, 0] = "Final Speech"

tabular_only = summary_df.to_latex(
    index=False,
    column_format="lrr",
    na_rep="0",
    escape=False
)

tabular_only = tabular_only.replace("\\toprule", "\\hline\\hline") \
                           .replace("\\midrule", "\\hline") \
                           .replace("\\bottomrule", "\\hline\\hline")

latex_str_resized = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{Token Counts by Cleaning Steps}\n"
    "\\label{tab:token_summary}\n"
    + tabular_only +
    "\n\\end{table}"
)

with open("Token_Summary.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_resized)

summary_df

Unnamed: 0,Speech Type,Total Tokens,Total Unique Tokens
0,Raw Speech,32179438,196135
1,Preprocessed Speech,4500778,35009
2,Final Speech,4445174,9473


### Speeches with the highest and lowest score

In [39]:
top5 = un_corpus_scored.nlargest(5, "score")[["country_name", "year", "score"]]
bottom5 = un_corpus_scored.nsmallest(5, "score")[["country_name", "year", "score"]]

top5.columns = ["Country", "Year", "Score"]
bottom5.columns = ["Country", "Year", "Score"]


# --- HTML ---
top_html = top5.to_html(index=False, classes="tb", border=0)
bottom_html = bottom5.to_html(index=False, classes="tb", border=0)

html_full = f"""
<div style="display:flex; justify-content:center; gap:40px; margin-top:20px;">

<div style="text-align:center;">
<h3>Top 5 Speeches by Score</h3>
<style>
.tb {{
  margin-left:auto; margin-right:auto;
  border-collapse: collapse;
}}
.tb th, .tb td {{
  padding:6px 12px; border:1px solid #ccc;
}}
.tb thead th {{
  background-color:#f2f2f2;
}}
</style>
{top_html}
</div>

<div style="text-align:center;">
<h3>Bottom 5 Speeches by Score</h3>
{bottom_html}
</div>

</div>
"""

html_full_with_note = html_full + """
<div style="text-align:center; margin-top:10px; font-size:0.9em; color:#555;">
Note: The two most and least emotional speeches are fully printed out in the Appendix in the 3_tables file of the Replication Package.
</div>
"""

with open("TopBottom_SideBySide.html", "w", encoding="utf-8") as f:
    f.write(html_full_with_note)

# --- LaTeX ---
top_tex = top5.to_latex(
    index=False,
    float_format="%.3f",
    column_format="l r r",
    na_rep="",
    escape=False
).replace("\\toprule","\\hline\\hline") \
 .replace("\\midrule","\\hline") \
 .replace("\\bottomrule","\\hline\\hline")

bottom_tex = bottom5.to_latex(
    index=False,
    float_format="%.3f",
    column_format="l r r",
    na_rep="",
    escape=False
).replace("\\toprule","\\hline\\hline") \
 .replace("\\midrule","\\hline") \
 .replace("\\bottomrule","\\hline\\hline")

latex_side_by_side = f"""
\\begin{{table}}[ht]
\\centering
\\caption{{Top and Bottom 5 Speeches by Emotionality Score}}
\\label{{tab:top_bottom_side_by_side}}

\\begin{{minipage}}{{0.48\\textwidth}}
\\centering
\\textbf{{Top 5 Speeches}}\\\\[4pt]
{top_tex}
\\end{{minipage}}
\\hfill
\\begin{{minipage}}{{0.48\\textwidth}}
\\centering
\\textbf{{Bottom 5 Speeches}}\\\\[4pt]
{bottom_tex}
\\end{{minipage}}

\\vspace{{2mm}}
\\footnotesize{{Note: The two most and least emotional speeches are fully printed out in the Appendix in the Script "3\_tables" of the Replication Package.}}

\\end{{table}}
"""

with open("TopBottom_SideBySide.tex", "w", encoding="utf-8") as f:
    f.write(latex_side_by_side)


top5, bottom5

(                           Country  Year     Score
 6011  Democratic Republic of Congo  1999  1.659556
 6370                      Cameroon  2001  1.627488
 2184                      Honduras  1974  1.554834
 6363                        Bhutan  2001  1.548451
 2438                     Guatemala  1976  1.541758,
            Country  Year     Score
 6263       Moldova  2000  0.315310
 8797  Turkmenistan  2013  0.317456
 5521         Italy  1996  0.333378
 9966  Turkmenistan  2019  0.337649
 8604  Turkmenistan  2012  0.343467)

#### Speeches of permanent security council members  with the highest and lowest scores

In [41]:
permanent_members = un_corpus_scored[un_corpus_scored['security_council_permanent'] == 1]

top5 = permanent_members.nlargest(5, 'score')[['country_name', 'year', 'score']]
bottom5 = permanent_members.nsmallest(5, 'score')[['country_name', 'year', 'score']]

top5 = top5.rename(columns={'country_name': 'Country', 'year': 'Year', 'score': 'Score'})
bottom5 = bottom5.rename(columns={'country_name': 'Country', 'year': 'Year', 'score': 'Score'})

# --- HTML ---

top5_html = top5.to_html(index=False, justify="center")
bottom5_html = bottom5.to_html(index=False, justify="center")

html_output = f"""
<div style="display:flex; gap:40px; justify-content:center;">

    <div>
        <h3>Top 5 Scores<br>(Security Council Permanent Members)</h3>
        {top5_html}
    </div>

    <div>
        <h3>Bottom 5 Scores<br>(Security Council Permanent Members)</h3>
        {bottom5_html}
    </div>

</div>
"""

with open("SC_top_bottom.html", "w", encoding="utf-8") as f:
    f.write(html_output)

# --- LaTeX ---

latex_top = top5.to_latex(
    index=False,
    column_format="lrr",
    escape=False
).replace("\\toprule", "\\hline\\hline") \
 .replace("\\midrule", "\\hline") \
 .replace("\\bottomrule", "\\hline\\hline")

latex_bottom = bottom5.to_latex(
    index=False,
    column_format="lrr",
    escape=False
).replace("\\toprule", "\\hline\\hline") \
 .replace("\\midrule", "\\hline") \
 .replace("\\bottomrule", "\\hline\\hline")

latex_combined = f"""
\\begin{{table}}[ht]
\\centering
\\caption{{Top 5 and Bottom 5 Speeches by  Emotionality Scores — Permanent Members of the Security Council}}
\\label{{tab:sc_top_bottom}}

\\begin{{minipage}}{{0.45\\textwidth}}
\\centering
\\textbf{{Top 5 Scores}}\\\\[3pt]
{latex_top}
\\end{{minipage}}
\\hfill
\\begin{{minipage}}{{0.45\\textwidth}}
\\centering
\\textbf{{Bottom 5 Scores}}\\\\[3pt]
{latex_bottom}
\\end{{minipage}}

\\end{{table}}
"""

with open("SC_top_bottom.tex", "w", encoding="utf-8") as f:
    f.write(latex_combined)

top5, bottom5

(             Country  Year     Score
 6514   United States  2001  1.490776
 5967   United States  1998  1.471603
 6891   United States  2003  1.449556
 8615   United States  2012  1.405341
 10940  United States  2024  1.360225,
      Country  Year     Score
 7312   China  2006  0.397298
 7618  Russia  2007  0.429954
 7126   China  2005  0.439656
 7426  Russia  2006  0.443977
 8194  Russia  2010  0.447654)

#### Length Stopwords & Procedural Words

In [43]:
# Create summary DataFrame
stopwords_df = pd.DataFrame({
    "List": ["Stopwords", "Procedural Words"],
    "Length": [len(stopwords), len(procedural_words)]
})

# --- HTML ---
html_table = stopwords_df.to_html(index=False, border=1)
with open("Stopwordslist_Lengths.html", "w", encoding="utf-8") as f:
    f.write(html_table)

# --- LaTeX ---
tabular_only = stopwords_df.to_latex(
    index=False,
    column_format="l r",
    na_rep="",
    escape=False
)

# Replace default rules with hlines
tabular_only = tabular_only.replace("\\toprule", "\\hline\\hline") \
                           .replace("\\midrule", "\\hline") \
                           .replace("\\bottomrule", "\\hline\\hline")

# Wrap in table environment with centering
latex_str_resized = f"""
\\begin{{table}}[htbp]
\\centering
\\caption{{Lengths of Stopwordslists}}
\\label{{tab:wordlist_lengths}}
{tabular_only}
\\end{{table}}
"""

# Write to file
with open("Stopwordslist_Lengths.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_resized)


stopwords_df

Unnamed: 0,List,Length
0,Stopwords,18997
1,Procedural Words,2884


#### Token Count of the dictionaries

In [67]:
summary = [
    ["Affect Dictionary", len(affect_dic)],
    ["Cognition Dictionary", len(cognition_dic)]
]

summary_df = pd.DataFrame(summary, columns=["Dictionary", "Number of Tokens"])

# --- HTML ---
html_table = summary_df.to_html(index=False)
with open("dictionary_token_summary.html", "w", encoding="utf-8") as f:
    f.write(html_table)

# --- LaTeX ---
tabular_only = summary_df.to_latex(
    index=False,
    column_format="lr",
    na_rep="0",
    escape=False
)

tabular_only = tabular_only.replace("\\toprule", "\\hline\\hline") \
                           .replace("\\midrule", "\\hline") \
                           .replace("\\bottomrule", "\\hline\\hline")

latex_str_resized = (
    "\\begin{table}[htbp]\n"
    "\\centering\n"
    "\\caption{Number of Tokens in Dictionaries}\n"
    "\\label{tab:dictionary_tokens}\n"
    + tabular_only +
    "\n\\end{table}"
)

with open("dictionary_token_summary.tex", "w", encoding="utf-8") as f:
    f.write(latex_str_resized)

summary_df

Unnamed: 0,Dictionary,Number of Tokens
0,Affect Dictionary,629
1,Cognition Dictionary,169


## Appendix

#### Yearly Emotionality Score

In [47]:
score_table = (
    un_corpus_scored
    .groupby('year')['score']
    .agg(['mean', 'count'])
    .reset_index()
    .rename(columns={'mean':'avg_score', 'count':'n'})
)

score_table['avg_score'] = score_table['avg_score'].round(3)

with pd.option_context('display.max_rows', None):
    display(score_table)

highest_year = score_table.loc[score_table['avg_score'].idxmax()]
lowest_year = score_table.loc[score_table['avg_score'].idxmin()]

print(f"Years with the highest average score: {highest_year['avg_score']} in {int(highest_year['year'])}")
print(f"Years with the lowest average score: {lowest_year['avg_score']} in {int(lowest_year['year'])}")

Unnamed: 0,year,avg_score,n
0,1946,0.861,39
1,1947,0.868,39
2,1948,0.85,39
3,1949,0.819,35
4,1950,0.881,44
5,1951,0.939,51
6,1952,0.861,43
7,1953,0.843,44
8,1954,0.807,42
9,1955,0.747,45


Years with the highest average score: 0.939 in 1951
Years with the lowest average score: 0.656 in 1997


#### Years with over 0.08 change in the emotionality score

In [49]:
# Compute year-over-year difference
score_table['diff'] = score_table['avg_score'].diff()

# Find years where increase >= 0.08
increased_years = score_table[score_table['diff'] >= 0.08]

# Find years where decrease <= -0.08
decreased_years = score_table[score_table['diff'] <= -0.08]

# Display increases
print("Years with an increase of >= 0.08:")
with pd.option_context('display.max_rows', None):
    display(increased_years[['year', 'avg_score', 'diff']])

# Display decreases
print("Years with a decrease of >= 0.08:")
with pd.option_context('display.max_rows', None):
    display(decreased_years[['year', 'avg_score', 'diff']])


Years with an increase of >= 0.08:


Unnamed: 0,year,avg_score,diff
10,1956,0.903,0.156
14,1960,0.894,0.087
52,1998,0.756,0.1
55,2001,0.899,0.192
68,2014,0.882,0.084


Years with a decrease of >= 0.08:


Unnamed: 0,year,avg_score,diff
24,1970,0.828,-0.093
29,1975,0.779,-0.091
56,2002,0.818,-0.081
59,2005,0.709,-0.09


#### Speeches with the highest and lowest score

In [51]:
# Closer look at the two most emotional speeches
def print_speech(country, year):
    speech_row = un_corpus_scored[
        (un_corpus_scored['country_name'] == country) &
        (un_corpus_scored['year'] == year)
    ]
    if not speech_row.empty:
        print(f"Speech from {country} in {year}:\n")
        print(speech_row.iloc[0]['speech'])
        print("\n" + "="*50 + "\n")
    else:
        print(f"No speech found for {country} in {year}.\n")

print_speech('Cameroon', 2001)

print_speech('Democratic Republic of Congo', 1999)

# Topic Cameroon: 
# -terrorist attacks, Taliban, Afghanistan
# -Condolences to 9/11 victims
# -unite forces to fight terrorism
# - Nobel Peace Price
# -mentions other conflicts/wars: Angola, Palestine, Dem, Rep. Congo
# -"demons", "profound", "sadness", "dismay", "terrible", "mourning", "urge", "brutal"


# Topic Dem. Rep. of Congo: 
# - congrats newly elected President of the General Assembly
# - quotes the UN Charter and criticizes double standards
# - criticizes members that violate the UN Charta
# - Blitzkrieg invasion of Bururndi, ruanda and Uganda in their country
# - criticizes exploitation of diamonds, cobalt, copper, and gold
# - Urges peaceful resolution and national reconstruction
# - "happy", "warmest", "love", "provocation", "violation", "torture", "attacked"

Speech from Cameroon in 2001:

﻿I should like at the outset to express the profound sympathy and condolences of Cameroon to the Government and the people of the United States of America and of the Dominican Republic for the accident involving the American Airlines Airbus on 12 November in New York. It was also with great dismay and sadness that we learned of the natural disaster that struck the brotherly people of Algeria with such severity. On behalf of the people and the Government of Cameroon, I would like to extend to that country our most profound condolences and solidarity. Rarely has a session of the General Assembly been such a focus of international public attention or aroused the interest of the worldwide media to the extent that the current session has. The annual session of the General Assembly is a powerful symbol of the coming together of nations, the promotion of cultures and respect for differences and freedoms. This year, however, a shadow has been cast over our sessio

In [52]:
# Closer look at the two most rational speeches
def print_speech(country, year):
    speech_row = un_corpus_scored[
        (un_corpus_scored['country_name'] == country) &
        (un_corpus_scored['year'] == year)
    ]
    if not speech_row.empty:
        print(f"Speech from {country} in {year}:\n")
        print(speech_row.iloc[0]['speech'])
        print("\n" + "="*50 + "\n")
    else:
        print(f"No speech found for {country} in {year}.\n")

print_speech('Moldova', 2000)

print_speech('Turkmenistan', 2013)

# Topic Moldova:

# Topic Turkmenistan:

Speech from Moldova in 2000:

Allow me at the outset, Mr. President, to convey to you our cordial congratulations and regards on your election as President of the fifty fifth session of the General Assembly. I am convinced that your competence and active cooperation with United Nations Member States will pave the way for a successful and fruitful session. I would also like to express sincere thanks to your predecessor, Mr. Theo Ben Gurirab, Minister for Foreign Affairs of Namibia, for the effective manner in which he guided the work of the previous session. At the same time, I wish to commend Mr. Kofi Annan, the Secretary General, for his firm leadership since taking office and for his visionary and actionoriented report “We the peoples:”. This report, together with the forward looking Millennium Declaration, adopted two weeks ago at the historic Millennium Summit, sets an ambitious agenda for the United Nations for the twenty first century. I would also like to convey our warmest welc