# Scraping the KB API:s for newspapers
Goal: Get links to all jp2-files. They are contained in "dark-packages" so we need to get these first.

Results: This code is used to produce (aftonbladet|svd|dn).csv

TODO: after KBs SSL certificate seemed to expire this code won't work until requests are rewritten like this:
```
# !? had to disable verify certificate, something strange going on... !?
    requests.packages.urllib3.disable_warnings()
    response = requests.get(url, headers=headers, verify=False)
```

#### GET known dark-package with all files (Norrköpings W)
```bash
curl -X 'GET' \
  'https://data.kb.se/dark-8428195' \
  -H 'accept: application/json'

```
#### GET Aftonbladet whole dataset
```bash
curl -X 'GET' \
  'https://data.kb.se/dataset/7mf87n2g5xzjf8r4' \
  -H 'accept: application/json'

```

This gets the metadata post for the dataset. But it is unusable, as it doesn't contains any links to dark-packages which are the api-objects that contain actual links to image files. For an alternative approach using the search API, see below.

#### Norrköpings Weckobladh

NOTE: The pages in the dataset browser are not sorted chronologically.

First page url, offset=0: https://data.kb.se/dataset/m1tnm4cpktxgf28n?offset=0

Final page url, offset=9990: https://data.kb.se/dataset/m1tnm4cpktxgf28n?offset=9990

Number of dark-packages: 9990+10(?)

NOTE: Seems like a very strange coincidence that the total number of dark-packages should be 10000 exactly. 1 dark-package is one number/issue.

#### Aftonbladet

First page url, offset=0: https://data.kb.se/dataset/7mf87n2g5xzjf8r4?offset=0

Final page url, offset=9990: https://data.kb.se/dataset/7mf87n2g5xzjf8r4?offset=9990

NOTE: Yep. Aftonbladet cuts off at 10000 too. That is not a coincidence, but arbitrary. The whole datasets have not been made available.


#### Get metadata containing 10000 (arbitrary maximum) dark-package objects

```bash
curl -X 'GET' \
  'https://data.kb.se/search/?q=%2A&%40type=package&inDataset=https%3A%2F%2Fdata.kb.se%2Fdataset%2F7mf87n2g5xzjf8r4&searchGranularity=package&_sort=title&limit=10000&offset=0' \
  -H 'accept: application/json'
```

This gets "all" 10000 dark-packages in one json-object. Of course this is not the whole dataset though...
The response actually contains "total": 10000 which confirms that KB has put in a 10k hardlimit for queries. Unbelievable. The API/web interface was completely unusable in practice before, but this makes it truly unusable in theory as well.

#### id:s for all newspaper-datasets

```yaml
"@id": "https://data.kb.se/dataset/m1tnm4cpktxgf28n"
"@type": "Dataset"
"title": "Norrköpings tidningar 1787-1895"

"@id": "https://data.kb.se/dataset/6ld76n1p44s12ht1"
"@type": "Dataset"
"title"  "Dagens nyheter"

"@id": "https://data.kb.se/dataset/7mf87n2g5xzjf8r4"
"@type": "Dataset"
"title": "Aftonbladet"

"@id": "https://data.kb.se/dataset/hwpjh1n7fnsskvfr"
"@type": "Dataset"
"title": "Norrköpings Weko-tidningar 1758-1786"

"@id": "https://data.kb.se/dataset/hwpjh11gfnk12r2z"
"@type": "Dataset"
"title": "Svenska dagbladet"

"@id": "https://data.kb.se/dataset/l0sqlrn8jcp5svln"
"@type": "Dataset"
"title": "Post och inrikes tidningar 1645-1705"
```


# 1. First experiments and results
Results – could request all 10K dark-packages through the API and extract relevant information. This information was put into a dataframe and saved as aftonbladet.csv. The next steps uses this csv file instead of API calls to get started. 



In [2]:
import requests
import json
import urllib.parse as ul
import pandas as pd
from hurry.filesize import size


This code sends a request to the API to get all dark-packages using a dataset id (for aftonbladet)

In [5]:
headers = {'accept': 'application/json'}
q = '*'
#inDataset = 'https://data.kb.se/dataset/7mf87n2g5xzjf8r4' #aftonbladet
#inDataset = 'https://data.kb.se/dataset/hwpjh11gfnk12r2z' #SvD
inDataset = 'https://data.kb.se/dataset/6ld76n1p44s12ht1' #DN 1864-1893
limit = 20000
offset = 0
url = f"https://data.kb.se/search/?q={ul.quote(q)}&type=package&inDataset={ul.quote(inDataset)}&searchGranularity=package&_sort=title&limit={str(limit)}&offset={str(offset)}"
print(url)
# This gets a dataset, but it is useless as the dataset contains no open references to dark-packages. These probably need to be scraped, incredibly...
#url = "https://data.kb.se/dataset/7mf87n2g5xzjf8r4"
response = requests.get(url, headers=headers)


https://data.kb.se/search/?q=%2A&type=package&inDataset=https%3A//data.kb.se/dataset/6ld76n1p44s12ht1&searchGranularity=package&_sort=title&limit=20000&offset=0


We got the http response. Now parse it and populate dark_list with dark_dicts.

In [6]:
json_response = json.loads(response.text)

print("Length: " + str(len(json_response)))

l = json_response["hits"]

number_of_dark_packages = len(l)
print("Length 'hits': " + str(number_of_dark_packages))

dark_dict = {"id": "", 
             "id_url": "",
             "title": "",
             "date_published": ""}
dark_list = []
ids = []
id_urls = []
titles = []
dates_published = []


# Loop
for i in range(len(l)):
    id = l[i]["@id"]
    id_url = "https://data.kb.se/" + id
    title = l[i]["title"]
    date_published = l[i]["datePublished"]
    
    dark_dict['id']=id
    dark_dict['id_url'] = id_url
    dark_dict['date_published'] = date_published
    dark_dict['title'] = title
    
    dark_list.append(dark_dict.copy())
    
    ids.append(id)
    id_urls.append(id_url)
    titles.append(title)
    dates_published.append(date_published)
    

    # print("id = "+id)
    #print("id_url = " + id_url)
    # print("titel = "+title)
    # print("date_published = "+date_published)

#print(dark_list)
count=len(dark_list)
print(f"Added {count} dark-packages to list.")

Length: 5
Length 'hits': 10000
Added 10000 dark-packages to list.


We can inspect which years are included:

In [None]:
#print(dark_list)

# for i in range(len(dark_list)):
#     print()
print(dates_published)

Then we parse dark_list and write a csv file. This is how we create the csv files used in the later notebooks for generating download scripts.

In [8]:
#csv_file='aftonbladet.csv'
#csv_file='svd.csv'
csv_file = 'dn.csv'

dict = {'id': ids, 'id_url': id_urls, 'date_published': dates_published, 'title': titles}
df = pd.DataFrame(dict)

df.head()
df.to_csv(csv_file)



# 2. Continue parsing from csv

Do some test and extract some info from csv.

In [9]:
df = pd.read_csv('dn.csv', index_col=0, parse_dates=['date_published'])
df.head()
df.tail()

# pandas regonizes timestamp correctly
last_row = df.iloc[number_of_dark_packages-1]
print (last_row)
print(last_row['id'])
print (type(last_row['date_published'])) 

id                                   dark-3702356
id_url            https://data.kb.se/dark-3702356
date_published                1893-03-24 00:00:00
title                  DAGENS NYHETER  1893-03-24
Name: 9999, dtype: object
dark-3702356
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


Do some test on single row, then on one specific dark-package to find what information to extract and how. This code should then be rolled into a loop

In [10]:
headers = {'accept': 'application/json'}
dark_id = last_row['id']
url = f'https://data.kb.se/{dark_id}'
print(url)
# Get dark-package:
response = requests.get(url, headers=headers)
#print(response.text)

json_response = json.loads(response.text)
#print(type(response))
#print(type(json_response))
#print(type(json_response['includes']))
l = json_response['includes']

# Explorative code that has been rolled into the loop/functions below
# print("Length of 'includes': "+str(len(l)))
# fileurl = l[0]['@id']
# filename = l[0]['fileName']
# contentsize = l[0]['contentSize']
# md5 = l[0]['checksum']['value']
# print(f'------ /CDHU/ Single file (idx:0) request from {dark_id}: -------')
# print(f'fileurl: {fileurl}')
# print(f'filename: {filename}')
# print(f'contentSize: {size(contentsize)}')
# print(f'md5: {md5}')

# This function gets items and data from dark-package:
def get_dark_package_items (l,i):
    fileurl = l[i]['@id']
    filename = l[i]['fileName']
    contentsize = l[i]['contentSize']
    md5 = l[i]['checksum']['value']
    return fileurl,filename,contentsize,md5

def print_dark_package_items(l):
    # This loop gets relevant items and metadata contained in the dark-package   
    for i in range(len(l)):
        fileurl, filename, contentsize, md5 = get_dark_package_items(l,i)
        print(f'------ /CDHU/ -------')
        print(f'fileurl: {fileurl}')
        print(f'filename: {filename}')
        print(f'contentSize: {size(contentsize)}')
        print(f'md5: {md5}')
        
print_dark_package_items(l)


https://data.kb.se/dark-3702356
------ /CDHU/ -------
fileurl: https://data.kb.se/dark-3702356/bib13991099_18930324_0_8588b_0001.jp2
filename: bib13991099_18930324_0_8588b_0001.jp2
contentSize: 25M
md5: dfc0093df9c1ec482c8dca0494164e00
------ /CDHU/ -------
fileurl: https://data.kb.se/dark-3702356/bib13991099_18930324_0_8588b_0002.jp2
filename: bib13991099_18930324_0_8588b_0002.jp2
contentSize: 24M
md5: 6c73b186d47401a9bf626c1fcfa9a237
------ /CDHU/ -------
fileurl: https://data.kb.se/dark-3702356/bib13991099_18930324_0_8588b_0001_alto.xml
filename: bib13991099_18930324_0_8588b_0001_alto.xml
contentSize: 1M
md5: 269dd5a062a82db5a15c9b702d29a01a
------ /CDHU/ -------
fileurl: https://data.kb.se/dark-3702356/bib13991099_18930324_0_8588b_0002_alto.xml
filename: bib13991099_18930324_0_8588b_0002_alto.xml
contentSize: 1M
md5: 8211b4f113d8b556253b1ff42c9f2153
------ /CDHU/ -------
fileurl: https://data.kb.se/dark-3702356/bib13991099_18930324_0_8588b_performance.xml
filename: bib13991099_1893

test some looping ...


In [11]:
print("Total issues (dark-packages):", len(df.index))
length = len(df.index)
LIMIT = 10
length = LIMIT
print("Issues to parse:", len(df.index))

issue_list = []
for i in range(length):
    row = df.iloc[i]
    dark_id = row['id']
    title = row['title']
    # print(dark_id)
    headers = {'accept': 'application/json'}
    url = f'https://data.kb.se/{dark_id}'
    print(f'(idx: {i} / title: {title})', end='')
    print(url)
    # Get dark-package:
    response = requests.get(url, headers=headers)
    # print(response.text)

    json_response = json.loads(response.text)
    l = json_response['includes']

    item_dict = {'title': '',
                 'fileurl': '',
                 'filename': '',
                 'contentSize': '',
                 'md5': '',
                 'dark_id': ''
                 }

    
    issue = []
    for i in range(len(l)):
        
        fileurl = l[i]['@id']
        filename = l[i]['fileName']
        contentsize = l[i]['contentSize']
        md5 = l[i]['checksum']['value']

        #print(f'------ /CDHU/ -------')
        # print(f'\t\tfileurl: {fileurl}')
        # print(f'\t\tfilename: {filename}')
        # print(f'\t\tcontentSize: {size(contentsize)}')
        # print(f'\t\tmd5: {md5}')

        # put in dict:
        item_dict = {'title': title,
                     'fileurl': fileurl,
                     'filename': filename,
                     'contentSize': contentsize,
                     'md5': md5,
                     'dark_id': dark_id
                     }
        issue.append(item_dict.copy())
        pass
    #print(issue)
    issue_list.append(issue.copy())
    pass
pass
###
# NEXT: HOW TO BEST REPRESENT THIS AS DATA IN THE NEXT STEP?
###

print("Length of issue_list:",len(issue_list))
print(issue_list[0][0]['fileurl'])
print(issue_list[0][1]['fileurl'])
print(issue_list[0][2]['fileurl'])
print(issue_list[0][3]['fileurl'])
print(issue_list[0][4]['fileurl'])

# dark_dict = {"id": "",
#              "id_url": "",
#              "title": "",
#              "date_published": ""}
# dark_list = []
# ids = []
# id_urls = []
# titles = []
# dates_published = []

#
# !!REMEMBER CHECK OUT THIS: Convert alto to ocr txt! https://github.com/cneud/alto-ocr-text . DONE.
#

Total issues (dark-packages): 10000
Issues to parse: 10000
(idx: 0 / title: DAGENS NYHETER  1864-12-23)https://data.kb.se/dark-3675680
(idx: 1 / title: DAGENS NYHETER  1865-01-02)https://data.kb.se/dark-3675682
(idx: 2 / title: DAGENS NYHETER  1865-01-03)https://data.kb.se/dark-3675683
(idx: 3 / title: DAGENS NYHETER  1865-01-04)https://data.kb.se/dark-3675684
(idx: 4 / title: DAGENS NYHETER  1865-01-05)https://data.kb.se/dark-3675685
(idx: 5 / title: DAGENS NYHETER  1865-01-07)https://data.kb.se/dark-3675687
(idx: 6 / title: DAGENS NYHETER  1865-01-09)https://data.kb.se/dark-3675686
(idx: 7 / title: DAGENS NYHETER  1865-01-10)https://data.kb.se/dark-3675688
(idx: 8 / title: DAGENS NYHETER  1865-01-11)https://data.kb.se/dark-3675689
(idx: 9 / title: DAGENS NYHETER  1865-01-12)https://data.kb.se/dark-3675690
Length of issue_list: 10
https://data.kb.se/dark-3675680/bib13991099_18641223_0_1_0001.jp2
https://data.kb.se/dark-3675680/bib13991099_18641223_0_1_0002.jp2
https://data.kb.se/dark-

In [None]:
#