### 2.5 Complete dataframe with FULL RADIOTECA

Some of the data frames come with a great amount of metadata that won't be necessary for our research purposes. To keep it we will maintain the separate dataframes and create a merged one extracting only the columns necessary for our research question. That will be the title, date and text/content. We will also create other columns that will be useful for our Exploratory Data Analysis. 

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

In [4]:
# loading the individual dataframes from their pickled form
CTILC_df = pd.read_pickle("Data/CTILC.pkl")
parlament_parla_df = pd.read_pickle("Data/parlaparla.pkl")
parlaMint_df = pd.read_pickle("Data/parlaMint.pkl")

**Comment:** We will treat radioteca differently by loading the data in chuncks instead of all at once as it would make the notebook crash. Moreover, as this notebook and process was re-done after scraping the enirety of radioteca df, we already know it will give us a very imbalenced distribution of the data. It is a massive dataframe but it will only bring in lots of data about the same years, which is not what we are looking for. Therefore, we will perform 2 strategies to reduce the amount of data and create a sample of the whole radioteca dataframe while trying to avoid any biases. 

The strategies, as in the original notebook, will be:\
**Strategy 1:Limiting Individual Contribution**\
Aiming for the most unbiased and representative data, we will first drop individual Speaker's contributions if they are over 1500 characters (about 200 words) long for the Radioteca data. This will allow for a less speaker-specific analysis.\
**Strategy 2: Balancing Program  Contribution**\
Since Radioteca has a lot of metadata, we will also use another piece of data ensure a more representative distribution while reducing the data size by limiting a show's episode contribution to 1500 characters as well. That way, we are also ensuring a more diverse corpora topic-wise and less show-specific data.

We will do it without loading the full dataset completely, year file per year file.

In [9]:
# function for strategy 1, speaker limit (we will reuse it for the other larger dataframes
def under1500(dataframe, contrivutorColName):
    '''
    takes in a dataframe
    after the cumulative text length of one speaker goes over 1500 characters
    the following contributions are no longer added to the dataframe
    ensuring a max of 1500 characters per speaker/contributor
    prints out the maximum and minimum length per contributor before and after the change
    '''
    if contrivutorColName == "Speaker":
        dataframe["CUMSUM_len"] = dataframe.groupby(["Episode", contrivutorColName])["Text_len"].cumsum()
    else:
        dataframe["CUMSUM_len"] = dataframe.groupby(contrivutorColName)["Text_len"].cumsum()
    max1500_df = dataframe[dataframe["CUMSUM_len"] <= 1500]
    if contrivutorColName == "Speaker":
        before_spkcount = dataframe.groupby(["Episode", contrivutorColName])["Text_len"].sum()
        after_spkcount = max1500_df.groupby(["Episode", contrivutorColName])["Text_len"].sum()
    else:
        before_spkcount = dataframe.groupby(contrivutorColName)["Text_len"].sum()
        after_spkcount = max1500_df.groupby(contrivutorColName)["Text_len"].sum()

    #printing contribution max and min to ensure the process worked out well
    print()
    print("Before speaker contribution limit:")
    print("The character count for the speaker contrbuting with the most amount of data is", before_spkcount.max())
    print("The character count for the speaker contrbuting with the least amount of data is", before_spkcount.min())
    print("After speaker contribution limit:")
    print("The character count for the speaker contrbuting with the most amount of data is", after_spkcount.max())
    print("The character count for the speaker contrbuting with the least amount of data is", after_spkcount.min())

    return max1500_df

In [21]:
import glob as glob

for f in glob.glob("Data/radioteca_year_*.parquet"):
    radioteca_Ydf = pd.read_parquet(f)

    # Strategy 1: limit speaker contributions to 1500 characters
    radioteca_Ydf = under1500(radioteca_Ydf, "Speaker")

    # Apply Strategy 2: limit episode contributions
    radioteca_Ydf["CUMSUM_ep"] = radioteca_Ydf.groupby("Episode")["Text_len"].cumsum()
    radioteca_Ydf = radioteca_Ydf[radioteca_Ydf["CUMSUM_ep"] <= 1500]

    radioteca_Ydf.to_parquet(f"radioteca_year_{year}_reduced.parquet")
    print(f"radioteca_year_{year}_reduced.parquet")

In [16]:
df = pd.read_parquet("data/full_radioteca_cleaned.parquet")

ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

In [11]:
# creating a reduced dataframe with only the date info, title of the file, content/text columns and its length
reduced_CTILC = CTILC_df.filter(["Year", "Title", "Text", "Text_len"])
reduced_parlament_parla = parlament_parla_df.filter(["Path", "Sentence", "Text", "Text_len"])
reduced_parlamint = parlaMint_df.filter(["Date", "Title", "Text", "Text_len"])
reduced_radioteca = radioteca_df.filter(["Year", "Line_id", "Text", "Text_len"])

NameError: name 'radioteca_df' is not defined

In [4]:
# adding a Source Corpora column to keep the data
reduced_CTILC["Source_corpora"] = "CTILC"
reduced_parlament_parla["Source_corpora"] = "Parlament Parla"
reduced_parlamint["Source_corpora"] = "ParlaMint"
reduced_radioteca["Source_corpora"] = "Radioteca.cat"

In [5]:
reduced_parlament_parla.head()

Unnamed: 0,Path,Text,Text_len,Source_corpora
0,clean_train/3/1/31ca4d158eaef166c37a_18.87_23....,perquè que el president de catalunya sigui reb...,85,Parlament Parla
1,clean_train/3/1/31ca4d158eaef166c37a_60.13_65....,que lliga absolutament amb allò que vostè diu ...,115,Parlament Parla
2,clean_train/2/8/2803008bb00cb0c86de6_17.0_30.1...,gràcies presidenta consellera atès l'inici del...,176,Parlament Parla
3,clean_train/2/8/2803008bb00cb0c86de6_31.03_44....,li volem preguntar si el seu departament té pr...,209,Parlament Parla
4,clean_train/2/8/2803008bb00cb0c86de6_44.74_53....,per tal d'iniciar la recuperació de l'ensenyam...,160,Parlament Parla


There is no date metadata provided for the parlament parla dataframe.\
For our purposes, as we need a date we will make an aproximation based off the data given by the owners.\
As Parlament Parla is presented as a corpora containing data from 2007 to 2015 we will assign the date aproximation 2010 to all the data in the Parlament Parla dataframe.\
We will also only keep the year infomation as the date in the corpora parlamint and the radioteca data frame where the full (month, day, year) information was provided.

In [6]:
# adding 2010 date to parlament parla rows
reduced_parlament_parla["Year"] = 2010

# keeping only the year on parlaMint
reduced_parlamint["Date"] = reduced_parlamint["Date"].apply(lambda x :x[0:4])

In [7]:
reduced_parlament_parla.head(2) # checking if we successfully created the Year column

Unnamed: 0,Path,Text,Text_len,Source_corpora,Year
0,clean_train/3/1/31ca4d158eaef166c37a_18.87_23....,perquè que el president de catalunya sigui reb...,85,Parlament Parla,2010
1,clean_train/3/1/31ca4d158eaef166c37a_60.13_65....,que lliga absolutament amb allò que vostè diu ...,115,Parlament Parla,2010


In [8]:
# casting year as integer
print(type(reduced_radioteca["Year"][0])) # it is currently a float that we extracted from the complete date
reduced_radioteca["Year"] = reduced_radioteca["Year"].apply(int)
print(type(reduced_radioteca["Year"][0])) # checking if data type is now what we need

<class 'numpy.float64'>
<class 'numpy.int64'>


In [9]:
# adjusting column names to match before concatenating
reduced_CTILC = reduced_CTILC.rename(columns ={"Year":"Year", "Title":"Line_id", "Text":"Text", 
                                                                   "Source_corpora":"Source_corpora", "Text_len":"Text_len"})
reduced_parlament_parla = reduced_parlament_parla.rename(columns ={"Year":"Year", "Path":"Line_id", "Text":"Text", 
                                                                   "Source_corpora":"Source_corpora", "Text_len":"Text_len"})
reduced_parlamint = reduced_parlamint.rename(columns ={"Date":"Year", "Title":"Line_id", "Text":"Text", 
                                                       "Source_corpora":"Source_corpora", "Text_len":"Text_len"})

In [10]:
# concatenating all datasets' relevant columns in a single data frame
complete_data = pd.concat([reduced_CTILC, reduced_parlament_parla, reduced_parlamint, reduced_radioteca]) 

In [11]:
complete_data.head()

Unnamed: 0,Year,Line_id,Text,Text_len,Source_corpora
0,1926,Discurs llegit per... donar a conèxer la perso...,"L'home que per amor al estudi, impulsat per un...",37497.0,CTILC
1,1920,Parlament llegit en la festa inaugural de l'Or...,"Cantaires de la Garriga, Senyores i senyors:\n...",9253.0,CTILC
2,1900,Discurs-pròlec,Discurs-prolec Llegit en la societat mèdic-far...,73881.0,CTILC
3,1894,Discurs,"Senyors excelentissims, senyors:\n\nQuan rebí ...",29393.0,CTILC
4,1903,Discurs,"Senyors:\n\nSembla que era air, y fa ja uns qu...",26577.0,CTILC


In [12]:
print(complete_data.isna().sum())

Year                     0
Line_id                  0
Text                     0
Text_len          29700188
Source_corpora           0
dtype: int64


In [13]:
# sorting joined data by year, from oldest to most recent
complete_data["Year"] = complete_data["Year"].apply(lambda x : int(x)) # casting all Years as integers, as we have some strings mixed up
complete_data.sort_values(["Year"])

Unnamed: 0,Year,Line_id,Text,Text_len,Source_corpora
27,1860,Discurs,"Breu seré, cuant ja se han complagut vostres o...",9551.0,CTILC
23,1868,Discurs,Excel·lentissim senyor:\n\nA últims del segle ...,18618.0,CTILC
26,1873,Discurs pronunciat en la sessió inaugural que ...,Senyors:\n\nDever meu es avuy 'l dirigirvos la...,36899.0,CTILC
25,1876,Teatre catalá,Sempre es estada tal la seua manera de reprodu...,62307.0,CTILC
21,1878,Discurs inaugural,Discurs inaugural.\n\nExcel·lentísim senyor; S...,21840.0,CTILC
...,...,...,...,...,...
445207,2025,Veu A00:40:27,"Encara vinculat a les drogues, atenció a aques...",,Radioteca.cat
445206,2025,Veu C00:40:06,"Aquests nous opioides es diuen nitasens, són u...",,Radioteca.cat
445205,2025,Veu A00:38:49,I atenció a la notícia que s'ha publicat a les...,,Radioteca.cat
445213,2025,Veu A00:42:41,"I a Barcelona, la Guàrdia Urbana investiga tam...",,Radioteca.cat


In [14]:
# resetting index after sorting
complete_data = complete_data.reset_index()

In [None]:
# pickling the complete joined data frame
complete_data.to_pickle("myfulldata.pkl")

In [None]:
#creating a dictionary of the cumulative amount of characters per year
year_length = {}
# setting year as the index of the dataframe
complete_data = complete_data.set_index("Year")
# iterating over the rows 
for year, row in complete_data.iterrows():
    if year not in year_length:
        year_length[year] = row["Text_len"]  # initializing year's length total
    else:
        year_length[year] += row["Text_len"]  # accumulating length to already present year

In [None]:
year_length = dict(sorted(year_length.items(), key=lambda item: item[1]))
for key in year_length.keys():
    value = year_length[key]
    #print(key, "-", value)

In [None]:
import matplotlib.ticker as ticker  

# placeholder y-values for timeline
years = list(year_length.keys())  
y_values = np.ones(len(years)) 

# creating mosaic layout for our multiple plots
fig, ax = plt.subplot_mosaic([["B", "B"],  # Timeline
                              ["A", "A"],  # Histogram
                              ["D", "D"]],  # Text count & corpus length
                             figsize=(10, 8),
                             constrained_layout=True)

# (A) Histogram of Document Distribution Over Time
ax["A"].hist(complete_data.index, bins=min(20, len(years)), histtype="step", color="blue", lw=1.5)
ax["A"].ticklabel_format(style='plain', axis='y')
ax["A"].set_title("Document Distribution Over Time")
ax["A"].set_xlabel("Year")
ax["A"].set_ylabel("Count")

# (B) Timeline (Scatter Plot)
ax["B"].scatter(years, y_values, color="blue", marker="o", lw=1.5)
ax["B"].set_title("Timeline of Documents")
ax["B"].set_xlabel("Year")
ax["B"].set_yticks([])  # Remove y-axis labels since they are not meaningful
ax["B"].grid(axis="x")

# (D) Text length per year in character counts
capped_values = [min(val, 10_000_000) for val in year_length.values()]  # limit to 10M

ax["D"].barh(list(year_length.keys()), capped_values, color="blue")
ax["D"].set_title("Text Length per Year (capped after 10M)")  
ax["D"].set_xlabel("Text Length")  

# Fix Tick Labels (Without `ticker`)
x_ticks = ax["D"].get_xticks()  
ax["D"].set_xticks(x_ticks)  
ax["D"].set_xticklabels([f"{int(x):,}" for x in x_ticks])  
ax["D"].set_xlim(0, 10_000_000)

# displaying the plots' mosaic
plt.show()