This notebook computes the descriptive statistics of the metadata and human annotations.

* Input: 'AgoraSpeech_preprocessed.csv'
* Output: prints
* Actions: For the entire dataset and per individual politician, it computes:
    1. Total number of speeches
    2. Total number of paragraphs
    3. Total number of words
    4. Average paragraphs per speech
    5. Average words per speech
    6. Average words per paragraph
    7. Percentage of paragraphs identified as 'criticism'
    8. Percentage of paragraphs identified as 'political agenda'
    9. Sum of unique topics of all speeches
    10. Average of unique topics per speech
    11. Average sentiment score of all speeches
    12. Average sentiment score per speech
    13. Average polarization score of all speeches
    14. Average polarization score per speech
    15. Average populism score of all speeches
    16. Average populism score per speech
    17. Sum of unique entities of all speeches
    18. Average of unique entities per speech
    19. Average of unique entities per paragraph

In [1]:
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
from descriptive_statistics_functions import *

In [2]:
# read the preprocessed data
data = pd.read_csv('AgoraSpeech_preprocessed.csv')

Descriptive statistics for the entire dataset

In [3]:
print(f"Statistics for the overall dataset ... ")
all_statistics(data, overall=True)

Statistics for the overall dataset ... 


General statistics:
-------------------
total number of speeches:  171
total number of paragraphs:  5279
total number of words:  717718
avg paragraphs per speech:  30.87
avg words per speech  4197.18
avg words per paragraph:  135.96
Criticism vs agenda percentages:
--------------------------------
agenda percentage:  0.6098
criticism percentage:  0.3902
Topics statistics:
------------------
sum_of_unique_topics_of_all_speeches:  33
avg_of_unique_topics_of_per_speech:  12.53
Sentiment, polarization and populsim statistics:
------------------------------------------------
sentiment_avg:  0.03
sentiment_avg_per_speech:  0.01
polarization_avg:  0.16
polarization_avg_per_speech:  0.16
populism_avg:  0.07
populism_avg_per_speech:  0.07
Entities statistics:
--------------------
sum of unique entities of all speeches:  7763
avg of unique entities per speech:  122.61
avg of unique entities per paragraph:  5.66


Descriptive statistics per individual politician

In [4]:
for politician in data['politician'].unique():
    politician_df = data[data['politician'] == politician]
    print(f"Statistics for {politician} ... ")
    all_statistics(politician_df)
    print("\n")

Statistics for Androulakis ... 
General statistics:
-------------------
total number of speeches:  26
total number of paragraphs:  883
total number of words:  132620
avg paragraphs per speech:  33.96
avg words per speech  5100.77
avg words per paragraph:  150.19
Criticism vs agenda percentages:
--------------------------------
agenda percentage:  0.6025
criticism percentage:  0.3975
Topics statistics:
------------------
sum_of_unique_topics_of_all_speeches:  32
avg_of_unique_topics_of_per_speech:  15.58
Sentiment, polarization and populsim statistics:
------------------------------------------------
sentiment_avg:  0.0
sentiment_avg_per_speech:  -0.01
polarization_avg:  0.17
polarization_avg_per_speech:  0.16
populism_avg:  0.09
populism_avg_per_speech:  0.09
Entities statistics:
--------------------
sum of unique entities of all speeches:  1605
avg of unique entities per speech:  137.92
avg of unique entities per paragraph:  6.51


Statistics for Koutsoumpas ... 
General statistics:
-