# The contours of the JISC Corpus

As the JISC corpus is not readily available to everyone, we provide a list of the titles.
This notebook explains which newspapers have categorized as belonging to JISC Corpus and to which entry in Michell's they are associated..

In [None]:
from pathlib import Path
import pandas as pd
import pickle

## Load data

In [None]:
#!unzip ../data/Press_Directories_1846_1920_JISC_final.csv.zip -d ../data

In [None]:
path = Path('../data/Press_Directories_1846_1920_JISC_final.csv')
df = pd.read_csv(path,index_col=0)

In [None]:
jisc_meta = pd.read_excel('../data/JISC_TitleList.xlsx', sheet_name='Titles')
jisc_meta.head(2)

# Select Title

For the paper we only looked at provincial (in the sense of non-Metropolitan titles) after 1846 (when the first edition of Mitchell's appeared.)

In [None]:
jsp = jisc_meta[jisc_meta.CATEGORY.isin(['scottish','welsh','provincial','irish']) & (jisc_meta.End_year >= 1846)]
list(jsp['Newspaper Title'])

Below we list the titles that were categorized as being in JISC.

In [None]:
sorted(df[df.IN_JISC > 0].TITLE.unique())

The folder `../data/jisc_links` contain annotations where we manually labeled pairs of titles (JISC and Mitchells) as referring to the same newspaper (labelled as "same") or (labelled as "different"). We then extended the same BL System ID to all other entries with the same `NEWSPAPER ID`. Below we create a table that allows you to compare the JISC title and the corresponding entry in Mitchells.

In [None]:

def get_links(pickle_path):
    year = pickle_path.stem.split('_')[-1]
    annotations = pickle.load(open(pickle_path,'rb'))
    same = [a for a in annotations if a[-1]=='same']
    links = []
    for obs,l in same:
        jisc_title = jisc_meta[jisc_meta['System ID']==obs[1]]['Newspaper Title'].values[0]
        mitchell_title = df[df.id==obs[2]]['TITLE'].values[0]
        chain_titles = df[(df.NEWSPAPER_ID==obs[3]) & (df.YEAR > int(year)) & \
                      (df.YEAR <= jisc_meta[jisc_meta['System ID']==obs[1]]['End_year'].values[0])]['TITLE'].values
        links.append(['manual',obs[1],obs[2],jisc_title,mitchell_title])
        links.extend([['newspaper_id',obs[1],obs[3],jisc_title,title] for title in chain_titles])
    return links

In [None]:
annotation_files = list(Path('../data/jisc_links/').glob('*.pickle'))
links = []
for af in annotation_files:
    links.extend(get_links(af))

jisc_link_df = pd.DataFrame(links,columns=['LINKING_METHOD','BL_SYSTEM_ID',"NPD_ID",'JISC_TITLE','MITCHELL_TITEL'])
jisc_link_df.sort_values(by=['BL_SYSTEM_ID'])
jisc_link_df.to_csv('../data/jisc_links.csv')

In [None]:
print('All done!')

# Fin.