# National Constituent Assembly speech data

Curated for the PNAS article [Individuals, institutions, and innovation in the debates of the French Revolution](https://www.pnas.org/content/115/18/4607.short) by Barron, Spang, Huang, and DeDeo.

## Preliminaries

In [1]:
import pandas as pd
import numpy as np
import os
import gzip

## Loading data

### Load speech data flatfile

* To acommodate the wide variety of characters in the speech text itself, this flatfile uses '=+=' as its column delimiter, with newline-delimited rows ('\n').
* The alternative would be to strip a particular character from all text in the data and use that as the delimiter; I opt to preserve the raw source data.
* A convenience function to read the table into a pandas dataframe is included below, since pandas built-ins don't like this situation.

In [13]:
def load_FRevNCA_speechdata(fpath):
    """
    Convenience function to load curated French Revolution NCA data.
    
    """

    dtypes = [np.int64, str, str, np.int64, str, np.int64] + [str]*6 + [np.bool] + [str]*5

    # Parse file.
    with gzip.open(fpath, mode='rt', encoding='utf-8', compresslevel=9) as f:
        _rows = [line.strip().split('=+=') for line in f.readlines()]

    # Apply data types.
    _rows_data = [[x(y) for x, y in zip(dtypes, _row)] for _row in _rows[1:]]

    return pd.DataFrame(_rows_data, columns=_rows[0])

In [15]:
# Load speech data.

flatfile_fpath = 'FRevNCA_speechdata.txt.gz'

FRevNCA_data_df = load_FRevNCA_speechdata(flatfile_fpath)

FRevNCA_data_df.shape

(44953, 18)

In [16]:
FRevNCA_data_df.head()

Unnamed: 0,NCASpeechId,Date,OrigFile,Volume,PbTagId,PageNum,SpeakerStr,Surname,Name,Affiliation,Estate,Club,President,CommitteeStatus,RawTextFr,RawTextEnTrans,ProcessedText,ProcessedVocabText
0,142,1789-07-09,bm916nx5550.xml,8,bm916nx5550_00_0280,211,M de Lally-Tollendal,Lally-Tollendal,Gérard,nonpos,2,nonpos,True,noncomm,donne lecture du procès-verbal.,reads the minutes.,donne lecture du procèsverbal,donne lecture procèsverbal
1,143,1789-07-09,bm916nx5550.xml,8,bm916nx5550_00_0280,211,M le Président,Lefranc de Pompignan,Jean-Georges,nonpos,1,nonpos,True,noncomm,prévient l'Assemblée que M. le rapporteur de l...,informed the Assembly that the Rapporteur of t...,prévient l assemblée que m le rapporteur de la...,prévient assemblée rapporteur députation baill...
2,144,1789-07-09,bm916nx5550.xml,8,bm916nx5550_00_0280,211,M Tronchet,Tronchet,François Denis,g,3,nonpos,True,noncomm,fait ce rapport. Il en résulte qu'il existe de...,made this report. It follows that there are tw...,fait ce rapport il en résulte qu il existe deu...,fait rapport résulte existe deux députations n...
3,145,1789-07-09,bm916nx5550.xml,8,bm916nx5550_00_0281,212,M Le Pelletier de Saint-Fargeau,Le Peletier de Saint-Fargeau,Louis-Michel,g,2,j,True,noncomm,Je crois qu'il faut plutôt les renvoyer toutes...,I think we should rather send them both than a...,je crois qu il faut plutôt les renvoyer toutes...,crois plutôt renvoyer toutes deux admettre exc...
4,146,1789-07-09,bm916nx5550.xml,8,bm916nx5550_00_0281,212,M le Président,Lefranc de Pompignan,Jean-Georges,nonpos,1,nonpos,True,noncomm,. Je demande s'il ne convient pas d'abord de s...,. I ask whether it is not appropriate first to...,je demande s il ne convient pas d abord de sta...,demande convient abord statuer première députa...


#### Column guide:

* `NCASpeechId`: universal speech index used for all data.
* `Date`: date of the speech.  These were cleaned and corrected from the original, which had errors in order and in formatting.
* `OrigFile`: original xml file.
* `Volume`: original volume of the Archives Parlementaires (AP).
* `PbTagId`: location id used throughout the original xml, useful for old FRDA web interface or working with original xml files.  The speech falls after this PbTagId and before the next, in AP page order.
* `PageNum`: page of the AP on which the speech occurs.
* `SpeakerStr`: speaker string provided by the FRDA xml.
* `Surname` and `Name`: identities disambiguated from all the SpeakerStrs.  These are the ones used in the PNAS analysis.  Note: although a lot of manual attention produced these attributions, they are not guaranteed 100% accurate!  There was significant noise in the SpeakerStr data - see the [Supplementary Material](https://www.pnas.org/content/suppl/2018/04/16/1717729115.DCSupplemental), _Preparing and characterizing speech data_ section, for more detail. "nomatch" indicates the speech's `SpeakerStr` was not assigned to a disambiguated entity.
* `Affiliation`: "g" (gauche), "d" (droite), "nonpos" (matched identity isn't positively identified as gauche or droite according to our historian co-author), or "nomatch" (no identity match was made to `SpeakerStr`).
* `Estate`, 1st/2nd/3rd estate, or "nonpos"/"nomatch" as for `Affiliation`.
* `Club`: an assortment of political clubs to which individuals belonged, or "nonpos"/"nomatch".
* `President`: binary presidential speech indicator.
* `CommitteeStatus`: "newitem" (speaker as committee proxy introduces a decree proposal to the floor), "indebate" (committee proxy speaks in the midst of debate), or "noncomm" (speaker is not a committee proxy).
* `RawTextFr`: The raw speech text obtained from the original xml.
* `RawTextEnTrans`: For giggles, I made a script circa ~2016 that queries Google Translate with all of the raw speeches.  Results included here.
* `ProcessedText`: `RawTextFr` after light tokenization.
* `ProcessedVocabText`: `ProcessedText` after removing words with fewer than 3 characters, stop words, then limiting to a 10,000-word vocabulary by highest observed frequency.

### Loading topics, speech topic mixtures, and vocabulary basis

The topic mixtures used in the paper were trained from `ProcessedVocabText`.  These, along with their associated topics and the vocabulary basis, are included here.  The vocabulary contains 10,000 words, and there are 100 topics.

In [17]:
# Load topics.

topicfile_fpath = "FRevNCA_ProcessedVocabText_topics.gz"
FRevNCA_topics_arr = np.loadtxt(topicfile_fpath)

FRevNCA_topics_arr.shape

(100, 10000)

In [18]:
# Load topic mixtures.

topicmixfile_fpath = "FRevNCA_ProcessedVocabText_topicmixtures.gz"
FRevNCA_topicmixtures_arr = np.loadtxt(topicmixfile_fpath)

FRevNCA_topicmixtures_arr.shape

(44953, 100)

In [19]:
# Load vocabulary basis.

vocabbasis_fpath = 'FRevNCA_ProcessedVocabText_vocabbasis.txt.gz'

with gzip.open(vocabbasis_fpath, mode='rt', encoding='utf-8', compresslevel=9) as f:
    d_ind_w = dict(map(lambda x: (int(x[0]), x[1]),
                       [line.strip().split(' ') for line in f]))

list(d_ind_w.items())[:10]

[(0, 'abandon'),
 (1, 'abandonnant'),
 (2, 'abandonne'),
 (3, 'abandonner'),
 (4, 'abandonné'),
 (5, 'abandonnée'),
 (6, 'abandonnées'),
 (7, 'abandonnés'),
 (8, 'abattre'),
 (9, 'abbaye')]

The integers associated with each word above correspond to indices of each topic.  So, you can display the top 10 words a range of topics like so:

In [20]:
# Create DataFrame tables of top words in each topic.

topnum = 10 # Number of top words to display.
chunknum = 6 # Number of topics to group into one table.

topicnum = FRevNCA_topics_arr.shape[0]

# Iterate over groups, or "chunks", of topics and
# create DataFrames for each chunk.
topword_dfs = []
for e in range(chunknum, topicnum+chunknum, chunknum):
    topics_chunk = FRevNCA_topics_arr[e-chunknum:e, :]
    
    # Collect the top words for each topic in this chunk.
    topwordvecs = []
    for topic in topics_chunk:
        top_word_idcs = np.argsort(topic)[-topnum:]
        topwords = [d_ind_w[k] for k in top_word_idcs]
        topwordvecs.append(topwords)

    # Combine the top words into a single DataFrame.
    topwordarr = np.vstack(topwordvecs).T
    names = ['Topic {}'.format(k) for k in range(e-chunknum, e, 1)[:len(topwordarr[1])]]
    topword_df = pd.DataFrame(topwordarr, columns=names)
    topword_dfs.append(topword_df)
    
# Display the first chunk's table.
topword_dfs[0]

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
0,demande,marseille,nîmes,propriétés,compte,procédure
1,voix,affaire,étaient,culte,rendre,juge
2,droite,décret,plus,été,exécution,juges
3,oui,assemblée,municipaux,propriété,moyens,accusation
4,plusieurs,été,garde,bénéfices,plus,jugement
5,non,comité,soldats,dîmes,prendre,fait
6,membres,nationale,municipalité,nation,commissaires,témoins
7,applaudissements,procédure,ville,ecclésiastiques,nationale,juré
8,gauche,tribunal,officiers,clergé,mesures,accusé
9,murmures,contre,régiment,biens,assemblée,jurés


## Q and A

Q: Why does `NCASpeechId` begin at 142?

A: The Archives Parlementaires contains speeches from before the National Constituent Assembly (NCA) was officially established. When parsing the original xml, `NCASpeechID` was assigned to all speeches in chronological order, but the NCA did not come into being until the 9th of July, 1789, and `NCASpeechID` 142 was the first speech on that day!

Q: Why are there NaN rows in the topic mixture array?

A: They were trained from the strings held in the `ProcessedVocabText` column.  If we show the speech data corresponding to those topic mixture NaN rows, we see that `ProcessedVocabText` was absent for these speeches:

In [34]:
# Show the speech data corresponding to NaN rows of the topic mixture array.

# Get the indices of NaN topic mixture rows.
topmix_nan_rowidcs = np.unique(np.where(np.isnan(FRevNCA_topicmixtures_arr))[0])

# Display corresponding cross-section of speech data.
novocab_data_df = FRevNCA_data_df.iloc[topmix_nan_rowidcs]
novocab_data_df[['NCASpeechId', 'Date',
                 'Surname', 'Name',
                 'RawTextFr', 'RawTextEnTrans', 'ProcessedText', 'ProcessedVocabText']]

Unnamed: 0,NCASpeechId,Date,Surname,Name,RawTextFr,RawTextEnTrans,ProcessedText,ProcessedVocabText
1080,1222,1789-08-29,nomatch,nomatch,et,and,et,
1243,1385,1789-09-11,Target,Guy Jean Baptiste,réfute cette allégation.,refutes this allegation.,réfute cette allégation,
13440,13582,1790-07-08,Repoux,Jean Marie,les combàt.,the combat.,les combàt,
16189,16331,1790-08-21,Dinochau,Jacques Samuel,"| fi'^é^M^craï^ II'r ett'tf(Kji)tei, ' i^^n'èî...",| (Kji) tiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii...,fi émcraï ii r ett tfkjitei in èîkoii f jë wi ...,
16780,16922,1790-08-31,Roederer,Pierre-Louis,lif; prpjet de prgçljijnatjpn.,lif; project of prg aljinat jpn.,lif prpjet de prgçljijnatjpn,
19254,19396,1790-10-28,nomatch,nomatch,"Nous nous y opposons, nous tous Alsaciens.","We oppose it, all of us Alsatians.",nous nous y opposons nous tous alsaciens,
19638,19780,1790-11-06,nomatch,nomatch,Vous nous insultez !,You insult us!,vous nous insultez,
20135,20277,1790-11-16,nomatch,nomatch,Faites-nous donc Un raisbfinement suivit,So give us a refresher followed,faitesnous donc un raisbfinement suivit,
22184,22326,1791-01-04,Maury,Jean-Sifrein,"Frappez, mais écoutez!","Strike, but listen!",frappez mais écoutez,
22332,22474,1791-01-06,nomatch,nomatch,....Prouvez J prouvez l,.... Prove J prove l,prouvez j prouvez l,


My personal favorites above:

* "It is so that you do not intrigue."
* "You slander yourself by putting it that way."

Moreover, when we compare the NaN topic mixture indices against NaN `ProcessedVocabText` indices, we see they are the same:

In [26]:
novocab_data_df.index.to_numpy()

array([ 1080,  1243, 13440, 16189, 16780, 19254, 19638, 20135, 22184,
       22332, 22379, 22922, 25970, 25979, 26511, 26625, 27386, 28114,
       28180, 30706, 31043, 32256, 33551, 34065, 34205, 34321, 35616,
       35900, 36281, 36573, 38029, 38730, 40662, 41890, 41897, 42062,
       42674, 42750, 43027, 44500])

In [27]:
topmix_nan_rowidcs

array([ 1080,  1243, 13440, 16189, 16780, 19254, 19638, 20135, 22184,
       22332, 22379, 22922, 25970, 25979, 26511, 26625, 27386, 28114,
       28180, 30706, 31043, 32256, 33551, 34065, 34205, 34321, 35616,
       35900, 36281, 36573, 38029, 38730, 40662, 41890, 41897, 42062,
       42674, 42750, 43027, 44500])

In [30]:
np.all(topmix_nan_rowidcs == novocab_data_df.index.to_numpy())

True