# Step02: Interface for selecting newspaper titles

This is step 02 of the **PressPicker** tool.

### Setup
Your local directory should look like:

```bash
.
├── Step02_visualisation.ipynb
├── Utils_Step02_visualisation.css.html
├── Utils_Step02_visualisation.js
├── datasets
│   ├── dynamic_io
│   │   ├── counties.csv
│   │   ├── timeseries_items_hc.csv
│   │   ├── timeseries_items_mf.csv
│   │   ├── titles.csv
│   │   ├── titles_hc.csv
│   │   └── titles_mf.csv
│   ├── previous_selections_to_exclude
│   │   ├── Jon-selection-streams4-7.xlsx
│   │   ├── Jon_hardcopySelection_Aug2019_withSublibrary.csv
│   │   ├── Jon_microfilmSelection_Aug2019.csv
│   │   ├── Picklist_8_CaseStudy_reviewed.xlsx
│   └── ...
└── ...
```

If you don't have any these files, please contact one of the developers.

---

**Note (for developers):** `timeseries_items_hc.csv` and `timeseries_items_mf.csv` are generated in `Step01_filtering_processing_newspapers.ipynb` notebook. Refer to the 'Outputs' cell where:

```python
timeseries_items_mf.to_csv(os.path.join(parent_path, "timeseries_items_mf.csv"))
timeseries_items_hc.to_csv(os.path.join(parent_path, "timeseries_items_hc.csv"))
```

---

### Installation
1. `pandas` can be installed via `pip`:

```bash
pip install pandas
```

Refer to https://pandas.pydata.org/pandas-docs/stable/install.html for more information.

### Import Python modules, JavaScript libraries and external code files

In [None]:
import numpy as np
from IPython.display import display, Javascript, HTML, Image
import os
import pandas as pd
import ipywidgets as widgets
from ipywidgets import Layout

# show all columns/rows of the dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None  # default='warn'

display(HTML("<style>.container { width:100% !important; }</style>"))

Load in the D3.js visualisation library and the visualisation .js and .html files:

In [None]:
%%javascript
require.config({
    paths: {
        d3: 'https://d3js.org/d3.v5.min'
    }
});

In [None]:
display(HTML(filename="Utils_Step02_visualisation.css.html"))
Javascript(filename='Utils_Step02_visualisation.js')

### Import datasets: `titles.csv`, `timeseries_mf.csv`, `timeseries_hc.csv`, `counties.csv`, `titles_hc.csv`, `titles_mf.csv`

In [None]:
# Paths to data
path2datasets = "datasets"
path2dynamicData = os.path.join(path2datasets, "dynamic_io")
path2previousSelections = os.path.join(path2datasets, "previous_selections_to_exclude")

In [None]:
# read in titles.csv, hc and mf timeseries csvs, and counties csv
titles = pd.read_csv(os.path.join(path2dynamicData, "titles.csv"), dtype=str, index_col=0)

timeseries_mf = pd.read_csv(os.path.join(path2dynamicData, "timeseries_items_mf.csv"), dtype=str, index_col=0)
timeseries_hc = pd.read_csv(os.path.join(path2dynamicData, "timeseries_items_hc.csv"), dtype=str, index_col=0)
timeseries_mf.index = timeseries_mf.index.to_series().apply(lambda x: '{0:0>9}'.format(x))
timeseries_hc.index = timeseries_hc.index.to_series().apply(lambda x: '{0:0>9}'.format(x))

records_hc = pd.read_csv(os.path.join(path2dynamicData, "titles_hc.csv"), dtype=str, index_col=0)
records_hc = records_hc[['Title.ID','Publication title', 'ITEM or VIT', 'Barcode', 'Item Status', 'Chron I','Chron J','Chron K','Enum A','Enum B','Enum C','sublibrary']]
# order by date
records_hc = records_hc.sort_values(by=['Chron I'])
records_mf = pd.read_csv(os.path.join(path2dynamicData, "titles_mf.csv"), dtype=str, index_col=0)
records_mf = records_mf[['Title.ID', 'Publication title','edition','locale','canNo','startReel','endReel','startDate','endDate','duplicate','LastModifiedOn','NewspaperItemID','TitleItemID','HoldingItemID']]
records_mf = records_mf.sort_values(by=['startDate'])

print("Loaded in %s titles" %len(titles))
print("%s mf timeseries" %len(timeseries_mf))
print("%s hc timeseries" %len(timeseries_hc))

In [None]:
# Load in the titles to corrected counties dataset
if os.path.isfile(os.path.join(path2dynamicData, "counties.csv")):
    titles_counties = pd.read_csv(os.path.join(path2dynamicData, "counties.csv"), dtype=str, index_col=0)
else:
    print(f"[ERROR] cannot not find {os.path.join(path2dynamicData, 'counties.csv')}. Have you run 'Preprocess_county_dataset' notebook?")

### Exclude prior selections

In [None]:
# read in Jon's previous selections
# WARNING - make sure the Title.IDs in the csv begin with a 0, or they will not be excluded
titles_mf_alreadyChosen_1 = pd.read_csv(os.path.join(path2previousSelections, "Jon_microfilmSelection_Aug2019.csv"), dtype=str)
titles_hc_alreadyChosen_1 = pd.read_csv(os.path.join(path2previousSelections, "Jon_hardcopySelection_Aug2019_withSublibrary.csv"), dtype=str)
titles_alreadyChosen_2 = pd.read_excel(os.path.join(path2previousSelections, "Jon-selection-streams4-7.xlsx"), sheet_name='Sheet1', dtype=str)
# Add leading zeros back on to Title.IDs
titles_alreadyChosen_2['Title.ID'] = titles_alreadyChosen_2['Title.ID'].apply(lambda x: '{0:0>9}'.format(x))

titles_alreadyChosen_3 = pd.read_excel(os.path.join(path2previousSelections, "Picklist_8_CaseStudy_reviewed.xlsx"), sheet_name='Overview', dtype=str)
titles_alreadyChosen_3['Title ID'] = titles_alreadyChosen_3['Title ID'].apply(lambda x: '{0:0>9}'.format(x))
# Rename id column Title.ID
titles_alreadyChosen_3.rename(columns = {'Title ID':'Title.ID'}, inplace = True)

# Concatenate all selections
titles_alreadyChosen = pd.concat([titles_mf_alreadyChosen_1, titles_hc_alreadyChosen_1, titles_alreadyChosen_2, titles_alreadyChosen_3, \
                                 ], sort=False)

# Create list of unique Title.IDs to exclude
titles_alreadyChosen_unique_ids = titles_alreadyChosen['Title.ID'].unique()
print("Exclude %s titles (from previous selections)" % len(titles_alreadyChosen_unique_ids))

In [None]:
# Exclude Jon's selections from titles, timeseries_mf, timeseries_hc
titles = titles[~titles['Title.ID'].isin(titles_alreadyChosen_unique_ids)]
timeseries_mf = timeseries_mf[~timeseries_mf.index.isin(titles_alreadyChosen_unique_ids)]
timeseries_hc = timeseries_hc[~timeseries_hc.index.isin(titles_alreadyChosen_unique_ids)]
print("Remaining %s titles" % len(titles))
print("%s mf timeseries" % len(timeseries_mf))
print("%s hc timeseries" % len(timeseries_hc))

### County dataset processing

In [None]:
# filter county dataset to only those in 'titles'
titles_counties = titles_counties[titles_counties['Title.ID'].isin(titles['Title.ID'].unique())]

In [None]:
county_ids = titles_counties.groupby('corrected_county')["Title.ID"].apply(list)
county_sum = titles_counties.groupby(['corrected_county']).agg({'Title.ID': 'count'})
county_sum = county_sum.rename(columns={"Title.ID": "count"})
county_merge = pd.merge(county_sum, county_ids, left_index=True, right_index=True)

In [None]:
print(('total newspapers in counties dataset = %s') % county_merge['count'].sum())    

In [None]:
# convert to dictionary 
county_merge_dict = county_merge.to_dict(orient ="index")

In [None]:
counties_forWidget = {}
for key, value in county_merge_dict.items():
    counties_forWidget[key + ' - ' + str(value['count'])] = value['Title.ID']

# Choose county
Select one or more counties from the menu produced by the following cell. Multiple values can be selected with shift and/or ctrl (or command) pressed and mouse clicks or arrow keys.

You can return to select different counties to visualise, but need to re-run the cell below ('Filter data by county/ies') too. 

In [None]:
w = widgets.SelectMultiple(
    options=counties_forWidget,
    description='County:',
    disabled=False,
    layout=Layout(width='500px', height='200px')
)
display(w)

### Filter data by county/ies

In [None]:
# access IDs for selected county and convert into a list
ids_for_county =list(w.value)
ids_for_county = [j for i in ids_for_county for j in i]

# filter titles, timeseries_mf, timeseries_hc by county
titles_countyFilter = titles[titles['Title.ID'].isin(ids_for_county)]
timeseries_mf_countyFilter = timeseries_mf[timeseries_mf.index.isin(ids_for_county)]
timeseries_hc_countyFilter = timeseries_hc[timeseries_hc.index.isin(ids_for_county)]
print('filtered to %s titles' % len(titles_countyFilter))

# save datasets in json format for visualising
titles_countyFilter.to_json(os.path.join(path2dynamicData, r'titles.json'), orient="records")
timeseries_mf_countyFilter.to_json(os.path.join(path2dynamicData, r'timeseries_items_mf.json'),orient="index")
timeseries_hc_countyFilter.to_json(os.path.join(path2dynamicData, r'timeseries_items_hc.json'),orient="index")

# Visualise

In [None]:
Javascript("""
(function(element){
    require(['newspaper_viz'], function(newspaper_viz) {
        newspaper_viz(element.get(0))
    });
})(element);
""")

### Uncheck boxes
Run the cell below to uncheck all check boxes (can run this as many times as you like): 

In [None]:
# Uncheck all check boxes
Javascript("""
var checkboxes = new Array(); 
var checktoggle = false;
checkboxes = document.getElementsByClassName("title_checkBox");
for (var i=0; i<checkboxes.length; i++)  {
    if (checkboxes[i].type == 'checkbox')   {
      checkboxes[i].checked = checktoggle;
    }
}
""")

### List selected microfilms

In the following cell, some info about the selected microfilms will be shown. 

In [None]:
try:
    selected_IDs_list
except NameError:
    print("[WARNING] selected_IDs_list is not defined. Have you selected any titles?")

In [None]:
if isinstance(selected_IDs_list, str):
    selected_IDs_list = selected_IDs_list.split(",")
selected_titles_microfilm = records_mf[records_mf['Title.ID'].isin(selected_IDs_list)]
print("#Selected unique titles: %s" % len(selected_IDs_list))
print("#Selected microfilms: {}".format(selected_titles_microfilm.shape[0]))
selected_titles_microfilm

### List selected hardcopies

In the following cell, some info about the selected hardcopies will be shown. 

In [None]:
# Get the hardcopy records for the selected IDs
if isinstance(selected_IDs_list, str):
    selected_IDs_list = selected_IDs_list.split(",")

selected_titles_hardcopy = records_hc[records_hc['Title.ID'].isin(selected_IDs_list)]
print("#Selected hardcopies: {}".format(selected_titles_hardcopy.shape[0]))
selected_titles_hardcopy

### OUTPUT

If you are happy with the selected titles, change the `output_id` to the filename you want and run the next cell. 
This will create two csv files in the `selections` directory which contain info about the selected microfilms and hardcopies (two separate files).

In [None]:
# choose an ID for your selected titles
output_id = "2019-03-16_MY_LABEL"
path2outputs = os.path.join("datasets", "selections")
if not os.path.isdir(path2outputs):
    os.makedirs(path2outputs)

# Ensure Title.ID field is type str, or .to_csv will remove leading 0s
selected_titles_microfilm['Title.ID'] = selected_titles_microfilm['Title.ID'].astype('str')
selected_titles_hardcopy['Title.ID'] = selected_titles_hardcopy['Title.ID'].astype('str')

selected_titles_microfilm.to_csv(os.path.join(path2outputs, "{}_microfilm.csv".format(output_id)), index=False)
selected_titles_hardcopy.to_csv(os.path.join(path2outputs, "{}_hardcopy.csv".format(output_id)), index=False)