This notebook aims at creating sample data for the "timestamped corpus" visualization prototype: https://observablehq.com/@dharpa-project/timestamped-corpus.<br>
This visualization uses a table containing a date, a publication name and the corresponding count as columns. It enables users to explore a corpus of documents displayed by publication, and to visualize the amount of documents per publication over a period of time and for example, consequently, to spot gaps in their data.

In [None]:
import pandas as pd
import os
import re

## 1. Data imports set-up

This example is based on the following dataset https://zenodo.org/record/4596345#.YguJQe7MLvW 

In [41]:
! wget "https://zenodo.org/record/4596345/files/ChroniclItaly_3.0_original.zip"

--2022-02-16 16:48:19--  https://zenodo.org/record/4596345/files/ChroniclItaly_3.0_original.zip
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96839242 (92M) [application/octet-stream]
Saving to: ‘ChroniclItaly_3.0_original.zip’


2022-02-16 16:54:30 (257 KB/s) - ‘ChroniclItaly_3.0_original.zip’ saved [96839242/96839242]



In [43]:
%%capture
!unzip ChroniclItaly_3.0_original.zip

In [45]:
folder_path = 'CI_newspaper_subcorpora/'

In [46]:
folders_list = os.listdir(folder_path)

## 2. Creation of initial dataframe with docs list and metadata

In [47]:
# retrieving all the document names from sub-folders (1 sub-folder = 1 publication)
tot_files_list = []
for folder in folders_list:
  files_list = os.listdir(f'{folder_path}{folder}')
  for file in files_list:
    file_path = f'{folder_path}{file}'
    tot_files_list.append(file)

In [48]:
# inserting the file names into a dataframe
sources = pd.DataFrame(tot_files_list, columns=['file_name'])

In [49]:
# get date from file name
def get_date(file):
  date_match = re.findall(r'_(\d{4}-\d{2}-\d{2})_',file)
  return date_match[0]

# get publication ref from file name
def get_ref(file):
  ref_match = re.findall(r'(\w+\d+)_\d{4}-\d{2}-\d{2}_',file)
  return ref_match[0]

In [50]:
sources['date'] = sources['file_name'].apply(lambda x: get_date(x))
sources['publication'] = sources['file_name'].apply(lambda x: get_ref(x))

In [51]:
pub_list = sources['publication'].unique()

In [52]:
# add publication names
def get_pub_name(pub_number):
    if (pub_number == 'sn85066408'):
        return "L'Italia"
    elif (pub_number == '2012271201'):
        return "Cronaca Sovversiva"
    elif (pub_number == 'sn84020351'):
        return "La Sentinella"
    elif (pub_number == 'sn85054967'):
        return "Il Patriota"
    elif (pub_number == 'sn84037024'):
        return "La Ragione"
    elif (pub_number == 'sn84037025'):
        return "La Rassegna"
    elif (pub_number == 'sn85055164'):
        return "La Libera Parola"
    elif (pub_number == 'sn86092310'):
        return "La Sentinella del West"
    elif (pub_number == 'sn92051386'):
        return "La Tribuna del Connecticut"
    elif (pub_number == 'sn93053873'):
        return "L'Indipendente"

In [63]:
sources['publication_name'] = sources['publication'].apply(lambda x: get_pub_name(x))

In [64]:
sources.head()

Unnamed: 0,file_name,date,publication,publication_name
0,sn85055164_1918-09-28_ed-1_seq-1_ocr.txt,1918-09-28,sn85055164,La Libera Parola
1,sn85055164_1921-12-17_ed-1_seq-1_ocr.txt,1921-12-17,sn85055164,La Libera Parola
2,sn85055164_1919-02-15_ed-1_seq-1_ocr.txt,1919-02-15,sn85055164,La Libera Parola
3,sn85055164_1922-04-01_ed-1_seq-1_ocr.txt,1922-04-01,sn85055164,La Libera Parola
4,sn85055164_1921-05-15_ed-1_seq-1_ocr.txt,1921-05-15,sn85055164,La Libera Parola


## 3. Aggregating data

In [65]:
# focusing only on publication names, date and count (will be computed by aggregation)
df = sources[['date', 'publication_name', 'publication']].copy()

In [66]:
df['date'] = pd.to_datetime(df['date'])

In [71]:
df = df.set_index('date')

In [73]:
def data_agg(df,pub_list):

  df_main = pd.DataFrame()

  for publication in pub_list:

    df_year = df.groupby([pd.Grouper(freq='Y'), 'publication_name']).count()
    df_year['agg'] = 'year'
    
    df_month = df.groupby([pd.Grouper(freq='M'), 'publication_name']).count()
    df_month['agg'] = 'month'

    df_week = df.groupby([pd.Grouper(freq='W'), 'publication_name']).count()
    df_week['agg'] = 'week'

    df_day = df.groupby([pd.Grouper(freq='D'), 'publication_name']).count()
    df_day['agg'] = 'day'

    df_main = pd.concat([df_main, df_year,df_month,df_day])
  
  return df_main
  

In [74]:
df_distrib = data_agg(df,pub_list)

In [76]:
# cleaning up
df_distrib = df_distrib.rename(columns={"publication": "count"})
df_distrib = df_distrib.reset_index(level=['date', 'publication_name'])
df_distrib = df_distrib.drop_duplicates()

In [80]:
df_distrib.head(5)

Unnamed: 0,date,publication_name,count,agg
0,1897-12-31,L'Italia,282,year
1,1898-12-31,L'Italia,185,year
2,1899-12-31,L'Italia,32,year
3,1900-12-31,L'Italia,52,year
4,1901-12-31,L'Italia,47,year


In [81]:
# this sample dataframe contains all aggregation options at once, but users will select one of the options in viz
# in non-prototypical context, viz will only receive data for one option (either year, month or day)
df_distrib.to_csv('df_distrib.csv', index=False)

## 4. Visualize

The csv file generated at the previous step can then be loaded into  the following visualization</br> https://observablehq.com/@dharpa-project/timestamped-corpus</br>
A demo of the visualization is available below, with the csv file already pre-loaded into Observable.


In [None]:
!pip install observable_jupyter

Collecting observable_jupyter
  Downloading observable_jupyter-0.1.12-py3-none-any.whl (31 kB)
Installing collected packages: observable-jupyter
Successfully installed observable-jupyter-0.1.12


In [None]:
from observable_jupyter import embed
# inputs={'source': text}

In [82]:
import json

In [93]:
# converting df to json to pass it directly to Observable iframe
# date and count needed as str for js
df_distrib['date'] = df_distrib['date'].astype('string')
df_distrib['count'] = df_distrib['count'].astype('string')
result = df_distrib.to_dict('records')

In [94]:
result[0]

{'agg': 'year',
 'count': '282',
 'date': '1897-12-31',
 'publication_name': "L'Italia"}

In [97]:
embed(
    '@dharpa-project/timestamped-corpus',
    cells=['viewof colorScheme','viewof timeSpan','viewof chart','viewof scaleTypeTest','viewof axisLabelTest','style'],
    inputs={'source': result}
)