# Data Curation
Jana Bruses | janabruses@pitt.edu | University of Pittsburgh | February 23, 2025

## **1. A few data considerations**
Since we are examining language variation, using Montoya Abat and Mas I Miralles notes in Linguistic Variation in the Governació d'Oriola that "written language always represents a later stage than the actual appearance of any linguistic phenomenon, since the written text incorporates innovations when in oral expression they have already been occurring for some time." They further add that in written language, "the appearance of occurrences tends to be considerably reduced." That is since standardization tends to exert a stronger influence on written language, particularly in literary works, which go through multiple reviews before publication.

In this regard, spoken data would likely provide the best approach for identifying traces of Catalan’s substitution. However, as we need our data to span a sufficiently long timeframe to capture changes and differences, finding open spoken resources from earlier periods is highly challenging. Therefore, we will work with the closest available approximation to spoken data: speeches, plenary sessions, and other written records derived from spoken works.

Not being able to find a Catalan corpora covering a timespan over 15 years. We will be using multiple corpora that fulfill the points mentioned above. 

In chronological order these are:
1) **CTIC(1832-1985)**\
Corpora originally containing texts published between 1832 and 1988 expanded to recent works after 2015. Created for the development of the descriptive dictionary of the Catalan language known as DDLC. Part of the corpora has been made available for public use. Only those works that are no longer subject to copyright in Spain are being made public work by work in single text files.\
The downloadable corpora consists of 337 files of literary works and 596 non-literary texts all published before 1985.\
Out of these works, we'll be using 28 speech delivered texts, specifcially speaches.

3) **Radioteca.cat**\
Library of over 300.000 IA-transcrived radio programs and summaries.
Would need to be web-scraped, need to ask for permission.

3) **Parlament Parla (2007-2018)**\
Speech corpus by Col·lectivaT containing Catalan Parliament (Parlament de Catalunya) plenary sessions from 2007 to 2018.\
Transcriptions have been aligned with the recordings and the corpora extracted.\
Corpora comprises 211 hours of clean and 400 hours of other quality segments where each speech segment is tagged with its speaker the speaker's gender.

4) **ParlaMint-ES-CT (2015-2022)**\
ParlaMint-ES-CT is the Spanish and Catalan parlamentary corpora covering from 2015 to 2022 as part of the corpora project ParlaMint: Comparable Parliamentary Corpora. The full corpora contains compliled subcorpora from 29 countries and autonomous regions in original languages as well as machine translated to English.

***If Radioteca.cat does not work:***\
Unfortunately, identifying a publicly available corpus covering the period between 1985 and 2007 that meets our criteria of being as close to spoken data as possible wasn't possible. This gap poses a challenge, as linguistic changes occurring during these two decades may be underrepresented in our analysis. However, we mitigate this by focusing on corpora that maintain a consistent genre—spoken or speech-derived texts—ensuring comparability across different time periods.

## **2. Data processing**

The data processing procedure for each of the corpora in cronological order follows.\
2.1 CTILC\
2.2 Radioteca.cat\
2.3 Parlament Parla\
2.4 Parlamint\
2.5 Combined

### 2.1 CTILC

The files are .txt, so we will parse them using plain text corpus reader.\
Their encoding is UTF-8 with FL line terminator, so they align with our encoding and line termination preference.\
No changes required.

In [36]:
#importing nltk's plain text corpus reader
from nltk.corpus.reader import PlaintextCorpusReader

corpus_root = '/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/CTILC/tots'
corpus = PlaintextCorpusReader(corpus_root, r'.*\.txt')

#print(corpus.fileids())  # getting filenames to get a file
print(corpus.raw('001858_Discurs_llegit_en_lo_Certamen_Catalanist.out.txt')[:1000])  # taking a look at one of the files

<DOCUMENT>
<OBRA id="1858">
<AUTOR>Sagarra i de Siscar, Ferran de</AUTOR>
<TÍTOL>Discurs llegit en lo Certamen Catalanista de la Joventut Católica de Barcelona</TÍTOL>
<ANY>1891</ANY>
<CLASSIFICACIÓ_TEXTUAL llengua="NLIT" gènere="" tema="2" subtema="2.7" traducció="no" variant="central" />
</OBRA>
<TEXT>Discurs llegit en lo certamen catalanista de la joventut católica de Barcelona

Excm. é Ilm. Sr. Senyors: Quan en Bonaventura Carles Aribau, al començarse lo segon terç de la presént centuria, ab aquell Adéu á la patria, plé de sentiment y tendresa, y ab aquell recort per la llengua en que soná son primer vagit quan del mugró matern la dolça llet bebia, iniciava lo modern renaxement de nostra literatura, ¿qui ho havia de dir, que en breu espay de temps, poetas y prosadors conqueririan tants llors pera les lletres catalanes? Y ab tot, no sols fou axí, sino que ab la remembrança de antigues gestes, al fer reviure grans homens y fets del llibre d' or de nostra historia, se despertá en nosa

In [37]:
# there is some metadata mixed up in the txt at the start of the file
# let's check if there is metadata at the end too

In [38]:
print(corpus.raw('001858_Discurs_llegit_en_lo_Certamen_Catalanist.out.txt')[-500:])

jorns de goig y benaurança! Tornarás á ser lliure y poderosa, y per valls y serres, per ciutats y vilatges, per tot arreu, los fills d' aquesta terra, ab cantichs d' amor y agrahiment que, com nuvols d'encens s' enlayrarán fins al trono del Altíssim, dirán joyosos:

Dins nostres pits encara la veu dels avis sona, de /fe\ y de /patria\ s' alça la flama en nostres llars, ... La gent de Catalunya jamay negá sa mare: fills som tots de la Verge que regna en Montserrat!  
He dit.

</TEXT>
</DOCUMENT>



In [39]:
# there is some metadata mixed that might actually be very useful
# so we will change the approach and parse the CTILC data as an xml and store it as a pandas dataframe
# where the id, author, title, date and other metadata will be strored in columns
# the text will have its own column

In [40]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

CTILC_data = []

for file in corpus.fileids():   
    soup = BeautifulSoup(corpus.raw(file), 'html.parser')
    data = {}  # dict for each file
    
    #ask how to do this with <OBRA id="1858">
    #data["ID"] = soup.find("OBRA").text if soup.find("OBRA") else np.nan
    data["Author"] = soup.find("autor").text if soup.find("autor") else np.nan
    data["Title"] = soup.find("títol").text if soup.find("títol") else np.nan
    data["Year"] = soup.find("any").text if soup.find("any") else np.nan
    data["Text"] = soup.find("text").text if soup.find("text") else np.nan
    #in the text the title begins the content, we will get rid of it as we have "\n\n" in between the title and the content
    data["Text"] = data["Text"].replace(r".*\n\n", "")

    CTILC_data.append(data)

# creating Pandas df
CTILC_df = pd.DataFrame(CTILC_data)
CTILC_df

Unnamed: 0,Author,Title,Year,Text
0,"Serra i Pagès, Rossend",Discurs llegit per... donar a conèxer la perso...,1926,Discurs llegit al objecte de donar a conèxer l...
1,"Millet i Pagès, Lluís",Parlament llegit en la festa inaugural de l'Or...,1920,Parlament llegit en la festa inaugural de l'or...
2,"Miró i Borràs, Oleguer",Discurs-pròlec,1900,Aforística médica popular catalana\n\nDiscurs-...
3,"Balari i Jovany, Josep",Discurs,1894,Discurs llegit en la festa dels jochs florals ...
4,"Torras i Ferreri, Cèsar August",Discurs,1903,Discurs llegit en la Sessió Pública Inaugural ...
5,"Torras i Bages, Josep",Parlament fet á la festa dels Jochs Florals de...,1899,Parlament fet á la festa dels Jochs Florals de...
6,"Collell, Jaume",Discurs pronunciat en la solemne festa dels Jo...,1899,Discurs pronunciat en la solemne festa dels Jo...
7,"Sagarra i de Siscar, Ferran de",Discurs llegit en lo Certamen Catalanista de l...,1891,Discurs llegit en lo certamen catalanista de l...
8,"Serra i Pagès, Rossend",Memoria,1905,Memoria del secretari del Consistori\n\nSenyor...
9,"Costa i Llobera, Miquel",Discurs,1906,Discurs del president del Consistori\n\nSenyor...


### 2.2 Radioteca.cat 
https://radioteca.cat/about-us

### 2.3 Parlament Parla

The files are .tsv, so we will parse them using pandas.\
Their encoding is UTF-8 but CRFL line terminators, that were converted to FL line terminators using dos2unix on terminal\
To do so on your bash environment: 
```
code to do that
```

In [41]:
# Loading data and creating a pandas dataframe:

In [42]:
# importing pandas, numpy and os
import os
corpus_root = 'dataCAT/ParlamentParla'

#create a function to properly do this without having to copy-paste so much
clean_tr = pd.read_csv('/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlamentParla/clean_train.tsv', sep='\t', header = 0)
clean_tr["Partition"] = "clean_tr"

clean_dev = pd.read_csv('/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlamentParla/clean_dev.tsv', sep='\t', header = 0)
clean_dev["Partition"] = "clean_dev"

clean_ts = pd.read_csv('/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlamentParla/clean_test.tsv', sep='\t', header = 0)
clean_ts["Partition"] = "clean_ts"

#other_tr = pd.read_csv('dataCAT/ParlamentParla/other_train.tsv', sep='\t', header = 0)
#other_tr["Partition"] = "other_tr"

other_dev = pd.read_csv('/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlamentParla/other_dev.tsv', sep='\t', header = 0)
other_dev["Partition"] = "other_dev"

other_ts = pd.read_csv('/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlamentParla/other_test.tsv', sep='\t', header = 0)
other_ts["Partition"] = "other_ts"

parlaments_parla_df = pd.concat([clean_tr, clean_dev, clean_ts, other_dev, other_ts]) #falta other_tr
parlaments_parla_df.columns = ['Speaker_id', 'Path', 'Sentence', 'Gender', 'Duration', 'Partition']
parlaments_parla_df

Unnamed: 0,Speaker_id,Path,Sentence,Gender,Duration,Partition
0,164,clean_train/3/1/31ca4d158eaef166c37a_18.87_23....,perquè que el president de catalunya sigui reb...,M,4.71,clean_tr
1,164,clean_train/3/1/31ca4d158eaef166c37a_60.13_65....,que lliga absolutament amb allò que vostè diu ...,M,5.50,clean_tr
2,336,clean_train/2/8/2803008bb00cb0c86de6_17.0_30.1...,gràcies presidenta consellera atès l'inici del...,M,13.15,clean_tr
3,336,clean_train/2/8/2803008bb00cb0c86de6_31.03_44....,li volem preguntar si el seu departament té pr...,M,13.02,clean_tr
4,336,clean_train/2/8/2803008bb00cb0c86de6_44.74_53....,per tal d'iniciar la recuperació de l'ensenyam...,M,8.49,clean_tr
...,...,...,...,...,...,...
1894,212,other_test/a/d/adee8af18ae122800ec2_289.34_298...,en primer lloc agrupar tota la normativa dispe...,F,8.92,other_ts
1895,212,other_test/a/d/adee8af18ae122800ec2_299.3_308....,d'acord però preferiríem i hi insistim no agru...,F,8.82,other_ts
1896,212,other_test/a/d/adee8af18ae122800ec2_315.12_327...,la segona finalitat és tenir una llei pròpia c...,F,12.85,other_ts
1897,212,other_test/a/d/adee8af18ae122800ec2_423.94_435...,que contribueixin a la transparència i a la pa...,F,11.96,other_ts


In [43]:
# EDA - Exploratory Data Analysis

In [44]:
parlaments_parla_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87424 entries, 0 to 1898
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Speaker_id  87424 non-null  int64  
 1   Path        87424 non-null  object 
 2   Sentence    87424 non-null  object 
 3   Gender      87424 non-null  object 
 4   Duration    87424 non-null  float64
 5   Partition   87424 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.7+ MB


In [45]:
parlaments_parla_df.shape

(87424, 6)

In [46]:
parlaments_parla_df.describe()

Unnamed: 0,Speaker_id,Duration
count,87424.0,87424.0
mean,184.029866,9.135149
std,104.624795,3.264426
min,0.0,4.0
25%,101.0,6.31
50%,169.0,8.8
75%,272.0,11.92
max,389.0,38.33


In [47]:
parlaments_parla_df.values

array([[164, 'clean_train/3/1/31ca4d158eaef166c37a_18.87_23.58.wav',
        'perquè que el president de catalunya sigui rebut pel president de la comissió europea',
        'M', 4.71, 'clean_tr'],
       [164, 'clean_train/3/1/31ca4d158eaef166c37a_60.13_65.63.wav',
        "que lliga absolutament amb allò que vostè diu de l'estat del benestar que és el corredor ferroviari del mediterrani",
        'M', 5.5, 'clean_tr'],
       [336, 'clean_train/2/8/2803008bb00cb0c86de6_17.0_30.15.wav',
        "gràcies presidenta consellera atès l'inici del debat sobre els pressupostos de la generalitat del dos mil disset i el període de planificació de l'inici del proper curs escolar",
        'M', 13.15, 'clean_tr'],
       ...,
       [212, 'other_test/a/d/adee8af18ae122800ec2_315.12_327.97.wav',
        'la segona finalitat és tenir una llei pròpia catalana amb relació a aquesta matèria sí que hi coincidim una llei pròpia en aquesta matèria i deixin-me afegir que per a nosaltres junts pel sí llei

In [48]:
parlaments_parla_df.columns

Index(['Speaker_id', 'Path', 'Sentence', 'Gender', 'Duration', 'Partition'], dtype='object')

In [49]:
parlaments_parla_df["Partition"].value_counts()

Partition
clean_tr     79269
clean_dev     2155
clean_ts      2144
other_dev     1957
other_ts      1899
Name: count, dtype: int64

In [50]:
parlaments_parla_df["Gender"].value_counts()

Gender
M    53220
F    34204
Name: count, dtype: int64

In [51]:
parlaments_parla_df["Duration"].mean() #find units (min?) (sec?)

9.135148929355784

In [52]:
parlaments_parla_df["Text_len"] = parlaments_parla_df["Sentence"].apply(len)
parlaments_parla_df

Unnamed: 0,Speaker_id,Path,Sentence,Gender,Duration,Partition,Text_len
0,164,clean_train/3/1/31ca4d158eaef166c37a_18.87_23....,perquè que el president de catalunya sigui reb...,M,4.71,clean_tr,85
1,164,clean_train/3/1/31ca4d158eaef166c37a_60.13_65....,que lliga absolutament amb allò que vostè diu ...,M,5.50,clean_tr,115
2,336,clean_train/2/8/2803008bb00cb0c86de6_17.0_30.1...,gràcies presidenta consellera atès l'inici del...,M,13.15,clean_tr,176
3,336,clean_train/2/8/2803008bb00cb0c86de6_31.03_44....,li volem preguntar si el seu departament té pr...,M,13.02,clean_tr,209
4,336,clean_train/2/8/2803008bb00cb0c86de6_44.74_53....,per tal d'iniciar la recuperació de l'ensenyam...,M,8.49,clean_tr,160
...,...,...,...,...,...,...,...
1894,212,other_test/a/d/adee8af18ae122800ec2_289.34_298...,en primer lloc agrupar tota la normativa dispe...,F,8.92,other_ts,125
1895,212,other_test/a/d/adee8af18ae122800ec2_299.3_308....,d'acord però preferiríem i hi insistim no agru...,F,8.82,other_ts,101
1896,212,other_test/a/d/adee8af18ae122800ec2_315.12_327...,la segona finalitat és tenir una llei pròpia c...,F,12.85,other_ts,210
1897,212,other_test/a/d/adee8af18ae122800ec2_423.94_435...,que contribueixin a la transparència i a la pa...,F,11.96,other_ts,201


In [53]:
#parlaments_parla_df.set_index("Speaker_id")

In [54]:
parlaments_parla_df["Text_len"].cumsum()

0             85
1            200
2            376
3            585
4            745
          ...   
1894    13174008
1895    13174109
1896    13174319
1897    13174520
1898    13174757
Name: Text_len, Length: 87424, dtype: int64

In [55]:
#downloaded catalan tokenizer: pip install stanza #look for data about this
#import stanza
#tokenizer = stanza.Pipeline('ca', processors='tokenize')
#doc = tokenizer(str(parlaments_parla_df.loc[100]["Sentence"]))
#tokens = [word.text for sentence in doc.sentences for word in sentence.words]
#print(tokens)
#parlaments_parla_df.loc[164]["Sentence"]
#doc
#tokens

In [56]:
#from tqdm import tqdm #to keep track because it takes crazy long
#tqdm.pandas()  # This adds the progress_apply method to Pandas

# Initialize the tokenizer before using it
#tokenizer = stanza.Pipeline('ca', processors='tokenize')

# Apply tokenization with a progress bar
#parlaments_parla_df["Doc"] = parlaments_parla_df["Sentence"].progress_apply(
    #lambda doc: [tokenizer(sent) for sent in doc]
#)

### 2.4 ParlaMint-ES-CT

Files are in multiple formats. To be consistent with the other datasets, which are less flexible, we'll use the tsv documents for the metadata and the txt documents for the content/transcription.

In [57]:
#ParlaMint part 1 - Metadata
import glob

# Define the path where ParlaMint data is stored
data_path = "/Users/janabruses/Documents/data_science/Linguistic-Markers-Catalan-Substitution/data:/ParlaMint-ES-CT.ana/ParlaMint-ES-CT.txt"  # Adjust to your directory

# Load all metadata TSV files
metadata_files = glob.glob(os.path.join(data_path, "**", "*-meta.tsv"))

# Read and concatenate all metadata files
metadata = []
for file in metadata_files:
    meta_df = pd.read_csv(file, sep="\t", index_col = False)
    metadata.append(meta_df)

metadata_df = pd.concat(metadata)
#metadata_df.drop("ID")

# Inspect metadata
#print(metadata_df.shape)
metadata_df

Unnamed: 0,Text_ID,ID,Title,Date,Body,Term,Session,Meeting,Sitting,Agenda,...,Speaker_MP,Speaker_minister,Speaker_party,Speaker_party_name,Party_status,Party_orientation,Speaker_ID,Speaker_name,Speaker_gender,Speaker_birth
0,ParlaMint-ES-CT_2022-07-07-3502,ParlaMint-ES-CT_2022-07-07-3502.1.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-07,Unicameralisme,XIV Legislatura,-,35,2,-,...,notMP,notMinister,GP-JxCAT,Grup Parlamentari de Junts per Catalunya,Coalition,Entre centredreta i dreta,BorràsLaura,"Borràs i Castanyer, Laura",F,1970
1,ParlaMint-ES-CT_2022-07-07-3502,ParlaMint-ES-CT_2022-07-07-3502.224.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-07,Unicameralisme,XIV Legislatura,-,35,2,-,...,MP,notMinister,GP-ERC,Grup Parlamentari Esquerra Republicana de Cata...,Coalition,Entre centreesquerra i esquerra,VilaltaMarta,"Vilalta i Torres, Marta",F,1984
2,ParlaMint-ES-CT_2022-07-07-3502,ParlaMint-ES-CT_2022-07-07-3502.2.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-07,Unicameralisme,XIV Legislatura,-,35,2,-,...,notMP,notMinister,GP-JxCAT,Grup Parlamentari de Junts per Catalunya,Coalition,Entre centredreta i dreta,BorràsLaura,"Borràs i Castanyer, Laura",F,1970
3,ParlaMint-ES-CT_2022-07-07-3502,ParlaMint-ES-CT_2022-07-07-3502.3.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-07,Unicameralisme,XIV Legislatura,-,35,2,-,...,notMP,Minister,ERC,Esquerra Republicana de Catalunya,-,Entre centreesquerra i esquerra,VilagràLaura,"Vilagrà Pons, Laura",F,1976
4,ParlaMint-ES-CT_2022-07-07-3502,ParlaMint-ES-CT_2022-07-07-3502.4.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-07,Unicameralisme,XIV Legislatura,-,35,2,-,...,notMP,notMinister,GP-JxCAT,Grup Parlamentari de Junts per Catalunya,Coalition,Entre centredreta i dreta,BorràsLaura,"Borràs i Castanyer, Laura",F,1970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,ParlaMint-ES-CT_2016-06-02-1702,ParlaMint-ES-CT_2016-06-02-1702.280.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2016-06-02,Unicameralisme,XI Legislatura,-,17,2,-,...,MP,notMinister,GP-JxSi;GP-REP,Grup Parlamentari de Junts pel Sí;Grup Parlame...,-,Partit arreplegador,ForcadellCarme,"Forcadell i Lluís, Carme",F,1955
280,ParlaMint-ES-CT_2016-06-02-1702,ParlaMint-ES-CT_2016-06-02-1702.281.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2016-06-02,Unicameralisme,XI Legislatura,-,17,2,-,...,MP,notMinister,GP-PPC,Grup Parlamentari del Partit Popular de Catalunya,Opposition,Entre centredreta i dreta,VillagrasaAlberto,"Villagrasa Gil, Alberto",M,1971
281,ParlaMint-ES-CT_2016-06-02-1702,ParlaMint-ES-CT_2016-06-02-1702.282.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2016-06-02,Unicameralisme,XI Legislatura,-,17,2,-,...,MP,notMinister,GP-JxSi;GP-REP,Grup Parlamentari de Junts pel Sí;Grup Parlame...,-,Partit arreplegador,ForcadellCarme,"Forcadell i Lluís, Carme",F,1955
282,ParlaMint-ES-CT_2016-06-02-1702,ParlaMint-ES-CT_2016-06-02-1702.283.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2016-06-02,Unicameralisme,XI Legislatura,-,17,2,-,...,MP,notMinister,GP-PPC,Grup Parlamentari del Partit Popular de Catalunya,Opposition,Entre centredreta i dreta,RodríguezSanti,"Rodríguez i Serra, Santi",M,1964


In [58]:
#ParlaMint part 2 - txt

text_files = glob.glob(os.path.join(data_path, "**", "*.txt"), recursive=True)

text_list = []
for file in text_files:
    if "README" not in file: # avoiding the README.txt getting mixed up with our data
        text_df = pd.read_csv(file, sep="\t", header = None)
        text_list.append(text_df)

text_df = pd.concat(text_list)

text_df = text_df.dropna()

text_df = text_df.rename(columns ={0: "ID", 1:"Text"})

text_df

Unnamed: 0,ID,Text
0,ParlaMint-ES-CT_2022-02-22-2501.1.0,"Molt bona tarda, ens disposem a començar una n..."
1,ParlaMint-ES-CT_2022-02-22-2501.2.0,"Buenas tardes, señora presidenta. Señores dipu..."
2,ParlaMint-ES-CT_2022-02-22-2501.3.0,"Per respondre, té la paraula la consellera d'A..."
3,ParlaMint-ES-CT_2022-02-22-2501.4.0,"Gràcies, presidenta. Molt bona tarda a totes i..."
4,ParlaMint-ES-CT_2022-02-22-2501.5.0,"En el torn de rèplica, té la paraula el diputa..."
...,...,...
126,ParlaMint-ES-CT_2016-06-09-1802.127.0,"Moltes gràcies, senyor Fernández. A continuaci..."
127,ParlaMint-ES-CT_2016-06-09-1802.128.0,"Gràcies, presidenta. Bon dia. Des de la CUP, e..."
128,ParlaMint-ES-CT_2016-06-09-1802.129.0,"Moltes gràcies, senyora Vehí. A continuació, t..."
129,ParlaMint-ES-CT_2016-06-09-1802.130.0,"Gràcies, presidenta. Vicepresident, consellera..."


In [59]:
# great! same number of rows!!!

In [60]:
parlaMint_df = metadata_df.merge(text_df, on="ID", how="outer")

In [61]:
parlaMint_df

Unnamed: 0,Text_ID,ID,Title,Date,Body,Term,Session,Meeting,Sitting,Agenda,...,Speaker_minister,Speaker_party,Speaker_party_name,Party_status,Party_orientation,Speaker_ID,Speaker_name,Speaker_gender,Speaker_birth,Text
0,ParlaMint-ES-CT_2015-10-26-0101,ParlaMint-ES-CT_2015-10-26-0101.1.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2015-10-26,Unicameralisme,XI Legislatura,-,1,1,-,...,notMinister,Independent,Independent,-,-,BayonaAntoni,"Bayona Rocamora, Antoni",M,1954,"Autoritats, senyores i senyors, bon dia i benv..."
1,ParlaMint-ES-CT_2015-10-26-0101,ParlaMint-ES-CT_2015-10-26-0101.2.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2015-10-26,Unicameralisme,XI Legislatura,-,1,1,-,...,notMinister,GP-CUP,Grup Parlamentari de la Candidatura d'Unitat P...,Opposition,Entre esquerra i extrema esquerra,DeJòdarJulià,"de Jòdar i Muñoz, Julià",M,1942,Un cop iniciada la sessió constitutiva amb la ...
2,ParlaMint-ES-CT_2015-10-26-0101,ParlaMint-ES-CT_2015-10-26-0101.3.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2015-10-26,Unicameralisme,XI Legislatura,-,1,1,-,...,notMinister,GP-CSP,Grup Parlamentari Catalunya sí que es Pot,Opposition,Esquerra,GinerJoan,"Giner Miguelez, Joan",M,1989,"«La Junta Electoral Provincial, en sessió ting..."
3,ParlaMint-ES-CT_2015-10-26-0101,ParlaMint-ES-CT_2015-10-26-0101.4.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2015-10-26,Unicameralisme,XI Legislatura,-,1,1,-,...,notMinister,GP-CUP,Grup Parlamentari de la Candidatura d'Unitat P...,Opposition,Entre esquerra i extrema esquerra,DeJòdarJulià,"de Jòdar i Muñoz, Julià",M,1942,Prego ara al secretari de la Mesa d'Edat senyo...
4,ParlaMint-ES-CT_2015-10-26-0101,ParlaMint-ES-CT_2015-10-26-0101.5.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2015-10-26,Unicameralisme,XI Legislatura,-,1,1,-,...,notMinister,GP-CSP,Grup Parlamentari Catalunya sí que es Pot,Opposition,Esquerra,GinerJoan,"Giner Miguelez, Joan",M,1989,"«La Junta Electoral Provincial, en sessió ting..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50819,ParlaMint-ES-CT_2022-07-21-3602,ParlaMint-ES-CT_2022-07-21-3602.95.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-21,Unicameralisme,XIV Legislatura,-,36,2,-,...,notMinister,GP-JxCAT,Grup Parlamentari de Junts per Catalunya,Coalition,Entre centredreta i dreta,BorràsLaura,"Borràs i Castanyer, Laura",F,1970,"I, finalment, en nom del Grup Mixt, té la para..."
50820,ParlaMint-ES-CT_2022-07-21-3602,ParlaMint-ES-CT_2022-07-21-3602.96.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-21,Unicameralisme,XIV Legislatura,-,36,2,-,...,notMinister,GP-GM,Grup Mixt,-,Entre centredreta i dreta,FernándezAlejandro,"Fernández Álvarez, Alejandro",M,1976,"Gràcies, presidenta. Quan es parla del litoral..."
50821,ParlaMint-ES-CT_2022-07-21-3602,ParlaMint-ES-CT_2022-07-21-3602.97.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-21,Unicameralisme,XIV Legislatura,-,36,2,-,...,notMinister,GP-JxCAT,Grup Parlamentari de Junts per Catalunya,Coalition,Entre centredreta i dreta,BorràsLaura,"Borràs i Castanyer, Laura",F,1970,"Moltes gràcies. Finalment, per pronunciar-se s..."
50822,ParlaMint-ES-CT_2022-07-21-3602,ParlaMint-ES-CT_2022-07-21-3602.98.0,"Corpus Parlamentari en català ParlaMint-ES-CT,...",2022-07-21,Unicameralisme,XIV Legislatura,-,36,2,-,...,notMinister,GP-CUP,Grup Parlamentari de la Candidatura d'Unitat P...,Opposition,Entre esquerra i extrema esquerra,CornellàDani,"Cornellà Detrell, Dani",M,1978,"Bé, gràcies. Primer de tot, faré una esmena in..."
