# Algorithmic toolboxes for the study of the filmic past—on newsreels & AI

 ### Robert Aspenskog Contributor1LastName [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0009-0005-4720-3352) 

Department of Cultural Sciences, Linnaeus University 

Department of Cultural Sciences, Lund University


### Mathias Johansson [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0002-3338-0551) 
Department of Arts and Cultural Sciences, Lund University

### Johan Malmstedt [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0001-5207-4296) 

GRIDH, University of Gothenburg 

metaLAB, Harvard University

### Emil Stjernholm [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0001-9871-5191) 
Department of Communication, Lund University

### Pelle Snickars [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0001-5122-1549) 
Department of Arts and Cultural Sciences, Lund University

[![cc-by](https://licensebuttons.net/l/by/4.0/88x31.png)](https://creativecommons.org/licenses/by/4.0/) 
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the [Creative Commons Attribution License CC-BY](https://creativecommons.org/licenses/by/4.0/)


In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "tags": ["figure-Gunnar-*"],
        "object": {
            "type":"image",
            "source": [
                "Gunnar Skoglund (1899–1983)—once hailed in the Swedish press "
                "as a “speaking artist”—did hundreds of voice-over for the "
                "national SF newsreel. At the time, his voice was arguably "
                "the most familiar in Sweden. Skoglund was also a skilled "
                "director, actor and producer of some thirty short films, "
                "dating from the late 1920s to the 1950s."
            ]
        }
    }
}
display(Image("./media/img1.png", width=1000), metadata=metadata)

 (optional) This article was orginally published (...)

FirstKeyword, SecondKeyword, AlwaysSeparatedByAComma

Using a computational film studies framework, this article examines a major
Swedish newsreel archive—the Journal Digital collection—deploying both signal
archaeology, named entity recognition, and geocoding. We also apply a specific
algorithmic toolbox on text extraction of intertitles, as well as a tweaked
tool, SweScribe, based on automatic speech recognition, in order to both
transcribe and timestamp speech from the collection’s sound films. Our basic
idea is to construct a number of mid-sized datasets from the Journal Digital
collection in different modalities, and proceed with an examination using
various approaches. Consequently, our intention is to increase the scholarly
capacity of media historical sources, while at the same time critically
scrutinizing AI and algorithmic toolboxes for the study of the past.

## Introduction

At the archive of Swedish Television (SVT) and Radio Sweden (SR) there are some
thirty volumes of so-called speaker lists used for voice-over in the production
of the national SF newsreel—Svensk Filmindustris Veckorevy (Svensk
Filmindustri’s Weekly Review). During the 1960s, the film company Svensk
Filmindustri (SF) sold its non-fiction film and newsreel archive to public
service radio and tv. The making of newsreels, however, had begun already in
1914, when SF started producing a weekly newsreel in a similar fashion and
format as in other countries. These newsreels were nationally distributed in
dozens of copies across Sweden; during the silent era they naturally included
textual intertitles. From 1932 and onwards sound was added; the preserved
voice-over scripts in the SVT and SR vaults hence range from 1932 until 1959.
These lists were as detailed as plentiful; today each volume in the archive
includes some 150 manuscripts, making the total number of speaker lists to
approximately 5,000. In a literal sense they were production manuscripts,
almost all contained handwritten edits, and small commentary. All likely they
served as the final manuscript for the person who did the voice-over commentary
in the film studio, describing the length in seconds when text should be
spoken, while also indicating what type of shot the edited newsreel depicted
during each particular sequence.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-first-newsreel-*"],
            "type":"image",
            "source": [
                "The first episode in a SF-newsreel from March 1933 depicted "
                "the Swedish frigate Vanadis, a naval training ship. "
                "The commentary was read by Gunnar Skoglund, and as is "
                "evident from the preserved speaker list, it was meticulous, "
                "customized in seconds, and almost identical to what was "
                "heard in the film. In English translation Skoglund rapidly "
                "announced: “Since the beginning of the 20th century, the "
                "venerable frigate Vanadis has been anchored at Skeppsholmen "
                "in Stockholm as a lodging and training ship. "
                "In its disrigged hull, two generations of sailors have slept "
                "the warrior’s heavy and well-deserved sleep in hammocks. "
                "Lying in a bunk it is called in sailor’s language, but it is "
                "not quite the same as sleeping in a bed at home. "
                "If nothing else, the alarm clock is a little louder here”— "
                "trumpet immediately!"
            ]
        }
    }
}
display(Image("./media/img2.png", width=1000), metadata=metadata)

For digital media historians, trying to analyse speaker lists from the SF
archive as a textual dataset poses challenges. If one, for example, uploads an
archival sample of a speaker lists to an AI-powered engine such as Perplexity,
with the prompt to OCR a particular section, the result is not particularly
impressive. The AI-model does a fair job with all typed Swedish text, but fails
with handwritten notes as well as words that have been manually crossed out.
Occasionally, Swedish words and letters even turn into Cyrillic script: “8 sek.
Strömmen. Alltsedan mitte av 1890-talet karxxіях кирракX Momenxxxxsixxxkxxmxx”.
Trying to work with the SF speaker lists in digitised form is hence scholarly
difficult. These lists exhibit many of the common traits that archival
documents often encompass; combined typed and handwritten text with corrections
often hinders OCR, and structuring of data in correct ways.

AI hallucination of a Swedish voice-over into Russian can of course also be
seen as a token of how large language models generate information that is both
historically inaccurate—and even fabricated. By now it is well known among
historians that LLM’s often hallucinate about the past, particularly when more
specific questions are posed. While it is true that generative AI does give apt
textual answers, such models have repeatedly been critiqued when it comes to
producing historical images. Fabian Offert has stressed that models such as
CLIP or DALL-E find themselves in ”a triple bind: they suffer from syntactic
invariability in the case of generally historical prompts, semantic arbitrarity
in the case of specifically historical prompts, and superficial, corporate
censorship that affects both” <cite id="1pjw9"><a href="#zotero%7C22783102%2FUK5SM8C2">(Offert, 2023)</a></cite>. The latter is,
of course, particularly problematic. Nearly all forms of generative AI are
circumscribed by commercial constraints; encoded values are neither epistemic
nor scholarly.

Despite these AI shortcomings regarding the past, in this article we aim to
analyse the SF-archive—today usually referred to as the Journal Digital
collection—with a diverse set of algorithmic tools. Given our initial
discussion, we have however refrained from digitising textual sources (such as
the preserved SF speaker lists), and instead focused on the audiovisual
material per se. Using a computational film studies framework <cite
id="568sd"><a href="#zotero%7C22783102%2FH4XMMWYS">(Oiva et al.,
2024)</a></cite>, this article hence examines the Journal Digital collection
deploying both signal archaeology, named entity recognition and geocoding. We
will also apply a specific algorithmic toolbox, developed for this article, on
text extraction of intertitles from silent newsreels—the application is called
stum—as well as using a tweaked tool, SweScribe, based on automatic speech
recognition, in order to both transcribe, and timestamp speech from the
collection’s sound films.

## Start of placeholders.
TODO: Remove

In [None]:
from IPython.display import Image, display

display(Image("./media/placeholder.png"))

 (optional) This article was orginally published (...)

FirstKeyword, SecondKeyword, AlwaysSeparatedByAComma

This is a hermeneutic paragraph

Editor|1641|1798|1916
---|---|---|---
Senan|0.55|0.4|0.3
Henry|0.71|0.5|0.63

In [None]:
# Check your Python version
from platform import python_version
python_version()

#!python -V

In [None]:
# pandas package needs to be added to the requirements.txt 's file
import pandas as pd
import zipfile
from pathlib import Path
import pandas as pd
import numpy as np
from pathlib import Path
from itertools import pairwise
import cv2
import ipywidgets as widgets
from IPython.display import display
import time
import threading
import os
from pathlib import Path
import plotly.graph_objs as go

import plotly.express as px
from wordcloud import WordCloud
import ipywidgets as widgets
from IPython.display import display
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

# TODO: Isort, check versions, add to requirements.txt
root = Path('.')
data = root / 'script'

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/college.csv")
df

## End of placeholders.
TODO: Remove

In [None]:
# Load the bare minimum.


Hermeneutics header about video quality.
... showing resolution and dpi from a frame and the previous image?



In [None]:
from IPython.display import Video, display

metadata={
    "jdh": {
        "module": "object",
        "tags": ["video-composition-*",],
        "object": {
            "type":"video",
            "source": [
                "It is indeed difficult to visualise a dataset with thousands "
                "of nonfiction films, still a small quick and dirty "
                "compilation of 28 silent films (from the 1920s) gives an "
                "impression of the vivid depiction of Swedish society—and "
                "modernity—that the SF-archive encompasses."
            ]
        }
    }
}
display(Video("./media/vid1.mp4", width=1000), metadata=metadata)

Our work is rooted in the research project Modern Times 1936, that explores
what software sees, hears and perceives when technologies for pattern
recognition are applied to media historical sources. Within this research
project we have prior been interested in the the historical gaze of generative
AI <cite id="l41c8"><a href="#zotero%7C22783102%2F7HB6GAHN">(Stjernholm et al., 2025)</a></cite>, algorithmic scaling of early cinema on YouTube <cite
id="guc1d"><a href="#zotero%7C22783102%2FTV37XT5P">(Stjernholm &#38; Snickars,
2024)</a></cite>, and techniques for assessing photorealism in synthetic images
<cite id="v6zpe"><a href="#zotero%7C22783102%2FVL4UHIWF">(Eriksson,
2024)</a></cite>. If Charlie Chaplin once in Modern Times (1936) struggled to
comprehend an industrialised world with giant machines, a common denominator in
our research project has been to explore how computational methods can help us
understand modernity in new ways. In this article, the idea is hence to
construct a number of mid-sized datasets from the Journal Digital collection in
different modalities, and proceed with an examination using various
computational approaches. Consequently, our intention is to increase the
scholarly capacity of media historical sources, while at the same time
critically scrutinizing AI and algorithmic toolboxes for the study of the past.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-SF-facilities-*"],
            "type":"image",
            "source": [
                "The Swedish SF film company had its production facilities "
                "with studio and film laboratory located in the so called "
                "Film-City (Filmstaden) in the suburb of Råsunda "
                "(north of Stockholm), from 1920 until 1969. As is evident "
                "from these late 1920s and 1930s photographs, the production "
                "of newsreels was a practical craft; it involved editing, "
                "cleaning, and copying film, as well as synchronising added "
                "sound. The film itself was nitrate-based celluloid, known "
                "both for its high image quality—and dangerous flammability. "
                "Illustrations from the Swedish Film Institute and the "
                "Swedish National Museum of Science and Technology. "
            ]
        }
    }
}
display(Image("./media/img3.png", width=600), metadata=metadata)

## Biography of a dataset

During the late 1950s, the film manuscript writer Gardar Sahlberg (1908–1983)
started to take an increased interest in the film archive at Svensk
Filmindustri (SF). At the time, however, the company archive of nitrate films,
such as newsreels and short films, was in dire need to be restored (and
preserved). Most of the oldest footage was filmed by cinematographers from
Swedish Biograph, a company that in 1919 merged into SF <cite id="vw0o9"><a
href="#zotero%7C22783102%2F6APYGVMY">(Olsson, 2022)</a></cite>. From the
beginning of the 1920s until the 1960s, SF had been the leading producer of
newsreels, educational cinema, and other types of useful films, distributed in
both theatrical and non-theatrical venues <cite id="2paaf"><a
href="#zotero%7C22783102%2FBQLCIUEW">(Stjernholm &#38; Florin Persson,
2019)</a></cite>. As a way to finance a reconstruction and improvement of the
SF archive, Sahlberg and SF decided to produce historical compilation films
based on the same old film material. In 1961, for example, the documentary När
seklet var ungt (When the century was young) had its premiere. Critics endorsed
the film—but audiences did not. Instead, SF started negotiations with Radio
Sweden (SR), which at the time was also responsible for national public service
television. During the winter of 1964, it was announced that SR had acquired
the whole newsreel archive from SF; one million meters of film dating from 1897
to 1960 was purchased <cite id="09k1h"><a
href="#zotero%7C22783102%2FBNUCIK7T">(Snickars, 2024)</a></cite>. The deal was
explosive, not only because of the number of preserved nitrate prints; the
SF-archive would now find a novel audience in another medium: television.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-seklet-*"],
            "type":"image",
            "source": [
                "The compilation film, När seklet var ungt "
                "(When the century was young) from 1961, was "
                "based on the oldest nonfiction films and "
                "newsreels within the SF-archive. In February "
                "1962, the director Gardar Sahlberg stated in "
                "a letter to the national archivist in Sweden, "
                "that the intention of the film was to "
                "show “the qualitatively impressive archival film material” "
                "to a culturally interested audience. "
                "At SF we had the belief, he admitted, "
                "that curiosity of this type of older films "
                "would be so great “that the money received "
                "from the box office would make it possible "
                "to go ahead, and finance the renewal of "
                "the [SF-]archive”. But unfortunately that "
                "was not the case—despite excellent reviews. "
                "When the film was shown "
                "“in a few places in the countryside, there was no audience,” "
               'he sadly concluded '
               '<cite id="8l0z1"><a href="#zotero%7C22783102%2FPECABWGW">(Sahlberg, 1962)</a></cite>'
            ]
        }
    }
}


display(Image("./media/img4.png", width=500), metadata=metadata)

As part of the deal, Sahlberg and a few colleagues at SF were hired by SR to
work with safeguarding the SF-archive. During the 1960s and 1970s, the archive
was preserved, duplicated, catalogued, and subsequently re-used in numerous
television programs. In fact, SR made a remarkable cultural-historical effort
in saving this film archive. From the old nitrate prints, Sahlberg had three
different film materials made: a master copy on 35 mm (for preservation), a
duplicate negative on 35 mm to obtain new copies, and a 16 mm display copy for
program producers at SR. Sahlberg also took personal responsibility to
catalogue the entire SF-archive, manual work he basically did on his own.
Notably, all films from SF were catalogued under a specific SF-number; SF2001
for example was the oldest film in the archive, dating from 1897. Since SR took
excellent care of these films, the company was able to acquire other film
collections as well. Some 400 films were purchased from Kinocentralen in the
mid 1960s, a company that had produced short and industrial films from the
early 1920s. Other similar films were donated to SR from, for example, the
Swedish Film Library (Filmhistoriska Samlingarna), the Salvation Army Sweden,
the Stockholm City Museum, and the film archive at the Swedish State Railways
(SJ)—the latter collection contained almost two hundred nonfiction films
produced by SJ between 1920 and 1960. SR also bought the newsreel Nuet (Now)
produced by the film company Nordisk Tonefilm during a few years in the mid
1950s <cite id="nx09j"><a href="#zotero%7C22783102%2FDSU6R48K">(Asp,
2014)</a></cite>. Finally, in 1969 SR also decided to acquire all short films
produced by SF, a deal that made the entire collection at SR amount to more
than five thousand films. As a consequence, the initial SF-archive came to
contain a range of different types of documentary film, with different
provenances. It should be noted, however, that the rationale behind all these
film acquisitions was the potential usage of old films in new tv-programs.
Still, SR was also interested in developing a sales organisation for
tv-programs, with film rights as a prospective revenue stream. An interesting
aspect of the initial purchase of the SF-archive hence concerned what type of
rights (to old footage) that SF actually sold to SR, since the archive also
contained films of foreign origin (such as Pathé and Gaumont). In a memo from
1964, the head of the film archive at SR therefore urged a certain degree of
caution when it came to the reuse of foreign films <cite id="ziu8k"><a
href="#zotero%7C22783102%2FDVKUX2N4">(Norrlander, 1964)</a></cite>.

In [None]:
from IPython.display import Video, display

metadata={
    "jdh": {
        "module": "object",
        "tags": ["video-composition2-*",],
        "object": {
            "type":"video",
            "source": [
                "The film collection at SR grew steadily during the 1960s—and "
                "it was indeed heterogenous. The collection included early "
                "cinema, silent newsreels from Swedish Biograph, Svensk "
                "Filmindustris Veckorevy, films from the Swedish State "
                "Railways (SJ) and their film archive, short films from "
                "Kinocentralen, and the 1950s newsreel Nuet, produced by "
                "Nordisk Tonefilm."
            ]
        }
    }
}
display(Video("./media/vid2.mp4", width=500), metadata=metadata)

From the mid 1960s, Swedish public service television appropriated the films
that Sahlberg and his colleagues had preserved, and catalogued. Footage from
the SF-archive and the other film collections at SR was reused in thousands of
tv programs <cite id="ujcdl"><a href="#zotero%7C22783102%2F24HEASLG">(Eriksson
et al., 2024)</a></cite>. In many ways these moving images shaped the ways that
Swedes perceived their past <cite id="7cqtj"><a
href="#zotero%7C22783102%2F2RDITF4M">(Eriksson et al., 2022)</a></cite>. In the
1980s, the first steps towards digitising the catalogue of films and metadata
were taken. Interestingly, the motivation for both microfilming the catalogue,
and developing a rudimentary database of information about the content of old
newsreels, were financial. When the new head of the TV archive (as it was now
called), Birgitta Lagnell, was interviewed in the mid 1980s, her major quest
was how to monetise the film archive: ”She will make the gold mine of
television profitable”, headlines stated <cite id="bsd9e"><a
href="#zotero%7C22783102%2FDW4S37Y5">(Bergman, 1986)</a></cite>.

Ten years later, an increased academic interest in the SF-archive resulted in
an externally funded research project with the aim to make the old film
collections more accessible. As a result, the department of cinema studies at
Stockholm University started a collaboration with the TV archive, granting
access to scholars and PhD students interested in the SF-archive. One of us
(Snickars) started his PhD training in cinema studies in 1995 by examining 16
mm prints from the SF-archive at a Steenbeck editing desk located in the vaults
of the TV archive. In parallel, and on the agenda at the time, Swedish public
service television began developing digital technology for terrestrial
television. A governmental report, SOU 1996:25—From mass media to multi media:
how to digitise Swedish television—laid the groundwork, and described in detail
the technical requirements. Since the digitisation of the SF-archive was
foremost media archival work, the TV archive contacted the publicly funded
Swedish National Archive of Recorded Sound and Moving Images (ALB) with a
request to scan the SF-archive to video—with the prospective to later transfer
content to digital tape <cite id="wfa1f"><a
href="#zotero%7C22783102%2F928EZRZ4">(ALB, 1996)</a></cite>.

During the summer of 1997, ALB started scanning; the deal was to transfer five
hundred nonfiction films every year <cite id="ivltg"><a
href="#zotero%7C22783102%2FR86YMEXD">(ALB, 1997)</a></cite>. Additional
external funding was secured, and ALB also decided to transfer all catalogue
information to machine readable formats. ALB had been inaugurated in the late
1970s, due to an extension in the Swedish deposit law that came to include
audiovisual material as well. ALB was in many ways a video archive in the
service of academic research; public service broadcasts were stored on magnetic
tape. Yet, in the mid 1990s it became all too apparent that video and tape
recordings would not sustain content for longer periods of time. In digital
format, however, it was likely that the same content could be preserved—for
ALB, the digitisation of the SF-archive hence developed into a case study of
how to proceed with such a major technical, and media archival transition. In a
description (for an application) from 1998, The digital newsreel archive, it
was stated that ALB was now planning “to digitise all scanned video tapes
\[from the SF-archive\]. The digitised recordings will then be stored \[on\]
discs in an automated archive”. Converted catalog information would also be
linked to each film. “This will allow the user to perform catalog searches, and
also view the requested film directly online on a computer, which would
effectively streamline research usage” <cite id="j8r7k"><a
href="#zotero%7C22783102%2FM63TBZ7V">(ALB, 1998)</a></cite>.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-ALB-*"],
            "type":"image",
            "source": [
                "In 2003, the Swedish National Archive of Recorded Sound and "
                "Moving Images (formerly ALB) launched Journal Digital—a "
                "collection with more than 5,000 digitised films and linked "
                "metadata, accessible through a user-friendly interface. "
                "At the same time, one percent of the films from Journal "
                "Digital were also made available online on the web, a site "
                "that rapidly became popular among (foremost elderly) Swedes. "
                "Today, twenty years later, at the portal filmarkivet.se—a "
                "joint venture between the National Library of Sweden and the "
                "Swedish Film Institute also—some three hundred SF-newsreels "
                "are openly available for anyone to watch."
            ]
        }
    }
}


display(Image("./media/img5.png", width=800), metadata=metadata)

The digitisation work proceeded in the coming years—albeit slowly. At a board
meeting in late 1999, the head of ALB, Sven Allerstrand, had to confess that
“unfortunately, as resources are lacking for an R&D function within ALB,
development work \[with the SF-archive\] is progressing very slowly. Resources
must be taken from ordinary operations, which has a negative impact on the
overall result.” However, Allerstrand stated, at our media archive we are still
“convinced that the only possible solution to ensure the long-term preservation
of ALB’s material is automated processing in a digital mass storage system”
<cite id="nnebq"><a href="#zotero%7C22783102%2FBXDUCBBT">(ALB,
1999)</a></cite>. A year later, ALB finally secured funding from the
government, and the transfer of the SF-archive into digital format proceeded
with a more rapid pace. In 2002, all newsreels and short films had finally been
digitised. However, since the film collection included not only films from
SF—but also films from Kinocentralen, the Swedish State Railways, and newsreels
from Nordisk Tonefilm—ALB decided to change the name of the digitised
collection to Journal Digital. A new search interface for the collection,
publicly accessible on computers at ALB, was developed, as well as a web site
with one percent of the collection online (through permission from SVT). In
all, Journal Digital gave access to 4,348 newsreels and short films from Svensk
Filmindustri, 421 nonfiction films from Kinocentralen, 267 Nuet-newsreels from
Nordisk Tonefilm, and 170 SJ-documentary films from the Swedish State
Railways—in all 5,206 films dating from 1896 to the mid 1960s. Consequently,
that is the amount of films that our dataset—the Journal Digital collection—is
based upon. It should be stressed, however, that in the following we will often
write about the analyses of newsreels, since they make up the major part of our
dataset. Yet, we are aware that our examined film material also contains other
genres.


In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-box-*"],
            "type":"image",
            "source": [
                "Via the content management system Box, Swedish researchers "
                "can today remotely access the audiovisual collections at the "
                "National Library of Sweden. While sound recordings are fine "
                "and acceptable, this is hardly the case for film or "
                "television content—which is displayed in low resolution "
                "formats, 704 x 576 pixels, in a small square on the screen "
                "measuring nine times seven centimetres. "
            ]
        }
    }
}


display(Image("./media/img6.png", width=800), metadata=metadata)

Formats tend to stick, and remain the same: when the Swedish National Archive
of Recorded Sound and Moving Images (ALB)—in the year 2000 digitised the
SF-archive, all films were converted into MPEG-2-format, a digital video and
audio compression standard developed during the 1990s by the Moving Picture
Experts Group (MPEG). The MPEG-2-versions were preserved on digital tape at
ALB. From these preservation files, another video converter compressed films
into a browse copy in MPEG-1-format, a lossy compression format with files
stored on a hard drive for instant access via an interface. 25 years ago,
MPEG-1 was a standard resolution for compressed video online. Yet, this is
obviously not the case any more. If film archivist Gardar Sahlberg during the
1960s made sure to preserve the SF-archive on both a master copy on 35 mm, a
duplicate negative on 35 mm, and a 16 mm display copy, this is far from how
researchers today are confronted with the SF-archive in the digital domain.
Decisions taken back then still linger; researchers working with audiovisual
materials today must still be satisfied with late 1990s low resolution copies.

In [None]:
from IPython.display import Video, display

metadata={
    "jdh": {
        "module": "object",
        "tags": ["video-filmarkivet-*",],
        "object": {
            "type":"video",
            "source": [
                "Proper filmic heritage is naturally dependent on digital "
                "quality; both sequences display the Stockholm exhibition in "
                "1897, shot by a Lumière cinematographer (SF2001). "
                "The version to the right is low resolution with pixels "
                "clearly visible if the frame is enlarged—the speed is also "
                "not adjusted. Sadly, this is the version that academic "
                "researchers are confronted with in the Box interface at the "
                "National Library of Sweden. In the version to the "
                "left—visible in the public interface filmarkivet.se—speed is "
                "adjusted, and the sequence is displayed in MPEG-2, a still "
                "acceptable digital resolution."
            ]
        }
    }
}
display(Video("./media/vid2.mp4", width=500), metadata=metadata)

Decisions made two decades ago still hold sway. Consequently, in this article
we have been working with low resolution (MPEG-1) copies from Journal Digital.
As stated, the National Library of Sweden and the Swedish Film Institute also
gives online access to some three hundred SF-newsreels in high resolution,
MPEG-2-versions, at the portal filmarkivet.se—thus, occasionally we will
illustrate our arguments with these film versions as well. Then again, there is
a sharp contrast between the highly curated selection of restored
high-resolution newsreels available through filmarkivet.se, and the low-quality
digitisation of a vast majority of the SF newsreel archive. In addition,
filmarkivet.se provides limited metadata, little contextualization using film
historical sources, and no possibility to analyze the content at hand as data,
effectively limiting the scholarly utility of the site <cite id="zmeiv"><a
href="#zotero%7C22783102%2F583554VC">(Snickars 2015: 65)</a></cite>.

In [None]:
# Load audio stuff

## Signal archeology of the audiovisual past

Annotating content within 5,205 (TODO) nonfiction films is already a demanding
task, but doing so with precision across both the sonic and visual domains
requires even greater care and effort. Previous research has emphasized the
time-consuming nature of audiovisual annotation <cite id="u4gve"><a
href="#zotero%7C22783102%2FWHSQVQAB">(Guyot et al., 2019)</a></cite>.
Nevertheless, to fully understand how these films are structured, both audio
and visuals must be taken into consideration. Not only do image and sound play
crucial roles within newsreels—naturally, after the introduction of sound in
the early 1930s—but the genre also provides a valuable window into the
historical formation of specific ways of juxtaposing these modalities within
the Journal Digital collection.

Importantly, approaching this film collection through an archaeology of the
audiovisual signal invites a shift in perspective: from meaning to trace, from
representation to inscription <cite id="9ahou"><a
href="#zotero%7C22783102%2FK4N8PIGZ">(Malmstedt, 2025)</a></cite>. Each film
can be understood as a layered field of signals that bears the marks of its
technological circumstances, institutional routines, and cultural habits. Sound
and image are, then, less regarded as vehicles of meaning, but rather
historical artefacts that carry within their textures and fluctuations the
sedimented practices of production and aesthetics. Accordingly, the initial
experiments conducted on this dataset account for both sound and image,
complementing human interpretation with automated annotations. This approach
enables the generation of first-level segmentations for each modality, allowing
for both separate and integrated navigation within large, untagged archival
collections. It also establishes a foundation for comparative analysis of the
relationship between the visual and auditory modalities, to be explored in the
following.

To analyse this aspect of audiovisual communication, we employ computational
methods that can register both the visual and sonic dimensions of newsreels.
Specifically, we use transformer-based models such as Moondream, a
vision-language model for object detection and vision tasks (Korrapati 2025),
and an AST (Audio Spectrogram Transformer, Gong et al. 2021). Each model
processes its respective sensory channel independently. Yet their embeddings
can be aligned to enable cross-modal comparison. To illustrate this workflow,
the following film segment has been annotated in both the audio and visual
domains. Both models were used in their publicly available, pre-trained forms,
which have demonstrated high general accuracy across a wide range of
audiovisual tasks. Moondream, a transformer-based vision-language model, is
designed for lightweight image understanding and captioning, enabling efficient
object detection and scene parsing even on modest computational resources.
However, we made several adjustments to tailor their performance to the
specific requirements of historical newsreel material. For the visual analysis,
we limited the number of detected objects in each frame to emphasize overall
compositional and semantic features rather than exhaustive enumeration. Without
this constraint, the model tended to generate redundant detections, for
instance, identifying every instance of a person separately, which led to
overly cluttered outputs. In the audio domain, we refined the results by
filtering out misleading detections, such as static noise being misclassified
as environmental sounds like rain. These calibrations allowed us to focus on
the broader audiovisual texture of the material rather than on granular or
noisy details.

## Traces of sonic experiments 

To understand how content is distributed, the first thing to consider is the
nature of the dataset itself. The SF-archive—and the subsequent Journal Digital
collection—varies widely in scope and condition; the latter have during decades
been conditioned by differences in film production, and later by dissimilar and
uneven processes of preservation. Some nitrate prints have hence only survived
in partial form, and (after the introduction of sound) many films still lack a
complete soundtrack, usually because the audio has deteriorated.

In [None]:
## Figure
audio_files_year = pd.read_csv(data / 'year_audio_files.tsv', sep='\t').set_index('Year')

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-audio_bars-*"],
            "type":"image",
            "source": [
                "The graph shows the overall distribution of audiovisual films "
                "in the Journal Digital collection—with the digital emphasis "
                "on files with sound As is evident, the distribution of files "
                "is uneven, with a gradual increase during the 1930s."
            ]
        }
    }
}

def get_audio_bars():
    fig =    px.bar(
        audio_files_year,
        y='audio',
        title='Files with Useful Sound per Year (≥1930)')

    fig.update_layout(
    xaxis_title="Year",
    yaxis_title="# of files with useful audio",
    xaxis=dict(
        tickmode='linear',
        tick0=1930,
        dtick=10
    ),
    yaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=10
    )
    )
    return fig

display(
    get_audio_bars(),
metadata=metadata)


Tracing and tracking sound, the pattern above is expected: early in the 1930s,
sound production of newsreels was limited by technical constraints and
substantial costs. However, the data also offers a revealing media-historical
insight into the gradual incorporation of sound into the newsreel format.
Whereas the arrival of sound in cinema is often portrayed as a sudden and
decisive transformation, usually pinpointing the talkie musical drama, The Jazz
Singer (1927)—featuring six songs performed by Al Jolson—evidence from the
Journal Digital collection indicate a more gradual and slower media migration,
that is a slightly uneven process of sound integration <cite id="kr6es"><a
href="#zotero%7C22783102%2FTEPVQNE3">(Beck, 2011)</a></cite>. Furthermore, the
second half of the graph above shows a gradual decline. At first glance, this
might seem surprising, but it is more likely a reflection of the structure of
the film collection than of an actual historical trend. If one instead plots
the proportion of films containing sound for each year, a more consistent
pattern of integration appears, evident in the graph below. It also provides a
clearer foundation for our following analysis, where we examine how image and
sound interact across different phases of newsreel development.

In [None]:
audio_files_year['share'] = audio_files_year['audio'] / audio_files_year['files']

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-audio_shares-*"],
            "type":"image",
            "source": [
                "Departing from newsreels with sound in the Journal Digital "
                "Corpus, the graph shows the share of digital film files with "
                "sound from 1930 until 1966. The drop in 1963 is the result of "
                "SF-newsreel production coming to a complete end."
            ]
        }
    }
}

def get_audio_line():
    fig =    px.line(
        audio_files_year,
        y='share',
        title='Share of Files with Useful Sound per Year (≥1930)')

    fig.update_layout(
    xaxis_title="Year",
    yaxis_title="# of files with useful audio",
    xaxis=dict(
        tickmode='linear',
        tick0=1930,
        dtick=10
    ),
    yaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=10
    )
    )
    return fig

display(
    get_audio_line(),
metadata=metadata)

One observation to be made, is that the graph makes the process of audiovisual
integration in newsreel production appear almost linear, a steady movement
toward the normalization of sound. But it is also striking to observe that
silent films continued to appear well into the 1940s. The data therefore resist
any neat periodisation. What can be seen instead is a slow and uneven adoption
that depends as much on institutional practice and local conditions, as on
technological possibilities. It is worth keeping in mind, and a reminiscence
that patterns one might observe later are shaped by underlying imbalances in
the film archive itself. But the gradual nature of sound transition is
interesting for another reason as well. It suggests that the Journal Digital
collection does not simply document the use of sound, but also the process
through which sound was being tested, adjusted, and creatively incorporated
into the newsreel format. In other words, films capture a moment of
experimentation, starting in the early 1930s when the vocabulary of sound in
nonfiction film production was still being invented, including the documented
ambivalence to the invention of audio supplementation within moving images.
When sound was finally incorporated, film critics such as Béla Balázs or Rudolf
Arnheim, and filmmakers such as Sergei Eisenstein and Vsevolod Pudovin,
lamented that it destroyed the purity of cinematic realism (<cite id="reghd"><a
href="#zotero%7C22783102%2FNEKVNJJD">Arnheim, 1957</a></cite>, <cite
id="eqhsq"><a href="#zotero%7C22783102%2FPCBNCFMD">Eisenstein et al.,
1994</a></cite>, <cite id="u2cy7"><a
href="#zotero%7C22783102%2F2IBV4AM6">Balázs, 2010</a></cite>, <cite
id="gu35x"><a href="#zotero%7C22783102%2FUVAHYVZ2">Balázs, 2017</a></cite>).
While others celebrated the shift as a deepening of the medium’s evidential
power.

In [None]:
# TODO Figure `28` goes here

When newsreels during the 1930s started to include sound—what could be heard in
Swedish cinemas? That is, what kind of specific sounds on a general level can
be detected in the Journal Digital collection by way of an aural, computational
analyses? As is evident from the graph above, the majority of the sound
consisted of speech and music. This is perhaps not entirely unexpected, and can
be understood as an extension of the role once played by intertitles, which
previously carried much of the newsreel’s informational value—an issue we will
return to. There also appears to be a slight trend toward increased use of
music over time. All of this holds true up until the final years represented in
the dataset. However, we should be mindful of the distribution of data in this
period, where the smaller number of film files means that percentages may be
skewed (by only a few examples).

If speech and music are obvious categories that a sound analysis of the entire
Journal Digital collection would detect, the third category (Other) is of more
interest. It encompasses both sound effects and diegetic elements, as well as
how they relate to actual imagery (Chion 1994). Listing the main types of
sounds in this category immediately gives some sense of their function. A large
number of detections are, for example, labeled as bursts or explosions. They
should not, however, be taken as evidence of a particularly violent film
corpus. In many cases, these are false positives: early optical audio
recordings simply contain too much noise, and the AST model tends to interpret
such random distortions as explosions. In fact, even to the human ear, the
newsreel soundtracks are often so rough that it is difficult to tell whether a
noise belongs to a recorded scene, or if it is simply an artifact of the medium
itself. Beyond these, the data also show frequent detections of animals, cars,
and other vehicles. A fascination for hearing modernity, and bringing the
sounds of the streets to cinema audiences, was apparent in the early reception
of the sound film technology. A Stockholm film critic in 1928, upon reviewing
the emergent sound-infused newsreels (from Denmark), exclaimed: “The cars
screeched, the horses’ hooves rattled—and far off in the distance the guard
parade approached. There came the first real reminder of where sound film has
its greatest significance: in the newsreels” (Dagens Nyheter 1928). In fact,
vehicle sounds appear especially often in our dataset, accounting for around
eleven percent of all entries in the Other category. This prevalence may merit
closer attention. The sound of a vehicle in the 1930s would have carried very
different connotations than it does today <cite id="rerwk"><a
href="#zotero%7C22783102%2F6AUNIX6I">(Brownell, 1972)</a></cite>. It was less
an everyday background noise—and more a distinctive signal of modernity. ”What
they heard was a new kind of sound that was the product of modern technology”
<cite id="mawwp"><a href="#zotero%7C22783102%2FURY9RMYJ">(Thompson,
2004)</a></cite>.

In [None]:
# TODO Figure `29` goes here

Another way to discover how sounds of modernity are manifested in the Journal
Digital collection, is to take a closer look at how vehicle sounds are paired
with particular types of images. Interestingly, none of the detected vehicle
sounds correspond directly to images of cars. Instead, they tend to appear
alongside signs of outdoor or on-location filming: buildings, stretches of sky,
grass, and trees all coincide with the presence of vehicle sounds. To some
extent, this may be an artifact of the analysis, since the model occasionally
identifies vehicle sounds in scenes where no vehicles are visible. Yet, the
pattern is revealing, and also demonstrates a surprisingly sophisticated way of
representing sound. Rather than merely mirroring what is seen, the soundtrack
adds another layer of meaning. Vehicle sounds are often paired with other kinds
of visual information, suggesting that sound was used to evoke atmosphere,
location, or a sense of movement rather than simply adding an ordinary sound
track. This raises the question how sound and image were actually attached,
both practically and conceptually, within newsreels.


To explore the relationship between sound and image in the Journal Digital
collection, the following graph presents a method designed to track recurring
pairings between audiovisual elements. This method extracts co-occurring audio
and visual label pairs from films and summarises their temporal association
across years. Visual labels are read at frame times, merged within a short
window around each timestamp, and cleaned to remove placeholders, numbers,
generic words, and banned categories. Audio categories come either from time
aligned probability scores or—when class names are unreliable—from a stored
list of top labels; both paths are converted into simple word tokens after the
same cleaning. At each timestamp the present audio tokens and visual tokens
form pairs that are tracked as segments with start time, end time, and
duration; very short segments are dropped, and files dominated by silence are
skipped early. Segments are then matched to a year using a name to year table,
then aggregated to compute total seconds and the number of distinct videos for
each pair, with a minimum total duration filter. Results are written to CSV
files including all pairs, per year summaries, and a ranked shortlist that
enforces per audio and per visual caps, along with a global limit so no single
label dominates. For the shortlisted pairs the code also exports per year
totals and percentages of each year’s total duration to support plotting and
later analysis.

In [None]:
# TODO Figure `30` goes here

Exploring the relationship between sound and image, some displayed correlations
point toward the emergence of a few specific patterns. A number of these are
straightforward and expected, such as artillery paired with gunfire, or
applause with crowds, or fire with scenes of gathering people. Others make
sense only partially, like the pairing of speedboats and airplanes, which
likely results from a misrecognition of similar sound textures. Then there are
combinations that are more puzzling, such as the frequent link between a bell
sound and the appearance of a flag. Overall, there seems to be a recurring
tendency for certain cross-domain associations to appear at specific moments in
time, rather than evenly across the dataset. It is also clear that the highest
occurrences are those paired with vehicle. Looking at the broader temporal
graph, some co-occurrences peak quite distinctly in the early 1940s. This
cannot simply be explained by the general volume of material from that period,
since the overall number of films was then beginning to decline. Instead, it
suggests an evolving approach to diegetic strategy—one that rises and recedes
rather than progressing in a straight line. To explore this issue further, we
manually examined the most frequent pairings identified across both sound and
image domains. From these observations we compiled a list of potential diegetic
connections. A rise in the use of sound was then observed, yet it is not clear
that a distinctly diegetic style ever became dominant. Sound and image appear
instead in the Journal Digital collection to function as largely independent
channels of information. This separation was likely shaped by practical
limitations, especially the uneven quality of early audio recording.

In [None]:
# TODO Figure `31` goes here

In 1947, Stellan Dahlstedt (1910–1991), published an article about the sound
studios at Svensk Filmindustri. Sound recording on film is hardly “the easiest
procedure in filmmaking, as more than one director or producer is willing to
attest”. During recording, it was often not possible to achieve the sound
effects needed, Dahlstedt continued. In particular, it was difficult to “adjust
the strength and character of the different sounds during live recording.
Therefore, each sound is recorded separately as much as possible and mixed
during replay”. It was no secret that such a practice increased the amount of
film stock used. Still, it saved time because there was “no need to experiment
with different sound effects during filming. It is the studio time that is the
most expensive part of a film recording” <cite id="cacs6"><a
href="#zotero%7C22783102%2FFDPKQ7P8">(Dahlstedt, 1947)</a></cite>. As director
of film technology at SF, Dahlstedt knew what he was talking about—even if he
was all likely writing about feature film production rather than newsreels.
Then again, his statement to avoid experimenting with sound effects during
filming, has relevance for our audio analyses. On-location sound in newsreel
production seems to have been both fashionable and valued during the period,
but also technically demanding. Most sounds were added during post-production,
especially voice-over narration, but our analyses cannot really confirm the
relation between diegetic and non-diegetic sound, at least not in a scholarly
decent way. What is lacking within our signal archaeology set-up is the ability
to align sound features more directly with visual content, as is today done in
several other multimodal learning approaches. If newsreels did not consistently
rely on synchronized or diegetic sound, it would be misleading for
computational models to assume that a visible object should always have an
audible counterpart. One way forward would be to integrate an adjacent layer of
information—the textual. Intertitles, commentary, and other written elements
often mediated the relationship between sound and image, and their gradual
incorporation into the audiovisual mix hence deserves closer attention.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-blueprint-*"],
            "type":"image",
            "source": [
                " At the Swedish SF company in the Film-City, north of "
                "Stockholm, a new sound laboratory was inaugurated in the "
                "late 1940s. It included a sound central (Ljudcentralen), an "
                "echo-chamber (Ekorum) and a muting room (Dämpat rum). On the "
                "second floor there was also a mixing room, and a specific "
                "newsreel facility."
            ]
        }
    }
}


display(Image("./media/img7.png", width=800), metadata=metadata)

## Exploring newsreel intertitles

While signal archaeology renders some insight into the audio dimensions of the
Journal Digital collection, such a dataset by definition excludes early
(silent) cinema. Yet non-visual data can also be gleaned from the latter by
focusing on so-called intertitles. After 1900 they were inserted between scenes
to provide important details about location, time or setting after a scene
shift. With films growing longer, the use of intertitles also expanded. From
1910 until the advent of sound cinema, intertitles became common practice in
the film industry with its own conventions and codes. For examples, intertitles
came to incorporate extended textual passages that filled the screen,
typographic differentiation to mark hierarchies between narrator and dialogue,
and ornamental as well as stylistic devices that underscored the intertitles
aesthetic function within the cinematic experience <cite id="ndzpp"><a
href="#zotero%7C22783102%2F8P2FL3R2">(Dupré La Tour 2005: 473)</a></cite>.

When multi-reel feature films started to emerge, the intertitles provided a
crucial function offering information necessary to condense the story or
provide dialogue <cite id="bzmpk"><a
href="#zotero%7C22783102%2F8BSVNXZQ">(Chaume, 2020: 106)</a></cite>. In
newsreels, meanwhile, the intertitles functioned to structure the visual
storyline, provide factual information about time, settings and portrayed
individuals, and offer commentary guiding audience reactions. With the
introduction of sound, the use of intertitles certainly changed, and the
introduction of voice-over commentary made the intertitles shorter.
Nevertheless, this overlooked and in previous digital historical research
neglected audiovisual artefact includes a lot of textual information. In fact,
the Journal Digital collection contains almost fifty thousand intertitles; they
appear in more than 4,300 nonfiction films. Emphasis in this section will be
placed on contextualizing the newsreel Svensk Filmindustris Veckorevy (Svensk
Filmindustri’s Weekly Review) and the Journal Digital corpus, describing the
transcription pipeline, and exploring the collection using Voyant.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-military-*"],
            "type":"image",
            "source": [
                "Within the Journal Digital collection there are a number of "
                "foreign nonfiction films, such as this unidentified short "
                "film fragment from the mid 1910s, produced by Universal "
                "Screen Magazine, with more than ten intertitles in less than "
                "two minutes—albeit with short military orders!"
            ]
        }
    }
}


display(Image("./media/img8.png", width=800), metadata=metadata)

Though previous research on newsreel intertitles is limited, there is a
noticeable tension regarding how the editorial comments should be interpreted
as a historical record. Some scholars argue that newsreel intertitles were
primarily descriptive and that it was only with the invention of sound
commentary that newsreels actually became a news medium proper. Nicolas Pronay,
for example, argues that the “range of social and political information which
could be conveyed by pictures and monosyllabic captions alone, was obviously
too restricted. The change-over to sound-film ... enabled them to cover any
subject which was news-worthy irrespective of whether it was pictorial in
nature” <cite id="8lcdf"><a href="#zotero%7C22783102%2FNTI97CTV">(Pronay, 1971:
412)</a></cite>. Similarly, Nicholas Reeves, writing on the topic of British
film propaganda during World War I, contends that the “dominant approach of the
newsreels’ editorial comment as carried by the titles was factual and
restrained” <cite id="nm7s1"><a href="#zotero%7C22783102%2FWQEJMM77">(Reeves,
1986: 199–200)</a></cite>. From this perspective, the textual information in
early cinema had a limited rhetorical function, due to both media technological
conditions and established genre conventions. Recent scholarship, however, have
argued that long before sound, newsreels included intertitles that “served to
explain and ideologically tint the footage sandwiched between them” <cite
id="v3shd"><a href="#zotero%7C22783102%2FHDPCSX8E">(Scott, 2024:
34)</a></cite>. Hence, the rhetoric in the newsreel intertitles guided viewers’
opinions in advance of what appeared on screen—prior to the introduction of
sound.


If the audio dataset of the Journal Digital collection gives us (at least some)
insights about Swedish modernity, we argue that a dataset of some fifty
thousand intertitles have a similar potential to provide commentary on Swedish
society, touching on a wide variety of cultural, technological, and societal
issues. Since intertitles were inserted into newsreeels—one textual prompt at
the time—the amount of words is not mind-boggling. Nevertheless, our dataset
still contains some 300,000 words. A key challenge when exploring these
newsreel intertitles using digital methods is the frequent occurrence of
mirrored flash intertitles. In film distribution and archiving, so-called flash
intertitles served a distinct purpose. Notably, flash intertitles do not appear
long enough for them to be readable. Rather they originally served as position
markers on the negative, which in this context were used as a master copy for
producing new prints, and hence pointing out exactly where to insert
intertitles when producing a positive. Besides pointing out the position of the
intertitles, the flash intertitles could also highlight the intended textual
and visual design. Further, this practice has been commonplace in film
distribution and archiving primarily to “save on expensive film material” <cite
id="9vrve"><a href="#zotero%7C22783102%2FJQZ27MNN">(Dobringer et al., 2013:
130)</a></cite>. Then again, to many film historians mirrored flash intertitles
have always been a nuisance, not the least since some formats for moving images
make it totally impossible to read them—on video it is hopeless, yet in an
editing table they can be read, and now also in digital format.

In [None]:
from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "tags": ["figure-military-*"],
            "type":"image",
            "source": [
                "It is almost impossible to read the low resolution, and "
                "mirrored flash intertitle, from SF’s Weekly Review "
                "(15 April 1929)—a day when prince Fredrik from Denmark "
                "arrived by train. The intertitle for the skiing competition "
                "in Oslo at Holmenkolmen is easier to understand "
                "(at least, if you speak Swedish), taken from, SF’s Weekly "
                "Review (26 February 1921). Mirrored flash intertitles are "
                "today still default for film historians using the Journal "
                "Digital collection at the National Library of Sweden; at "
                "filmarkivet.se, however, restored and high resolution "
                "intertitle meets the viewer."
            ]
        }
    }
}


display(Image("./media/img9.png", width=800), metadata=metadata)

There is little previous research on newsreel intertitles using digital
methods. A notable exemption is a study on the Jean Desmet collection
(1907–16), housed by the EYE Film Museum in the Netherlands. The authors show
the usefulness of deep learning methods to detect intertitles in audiovisual
corpora as markers of narrative structure <cite id="hqu0v"><a
href="#zotero%7C22783102%2FPUAS8ZHL">(Bhargav et al., 2019)</a></cite>. While
Bhargav and colleagues demonstrate the technical feasibility of using deep
learning to detect and analyse intertitles in early cinema, our study expands
the approach by developing a multimodal transcription pipeline for a much
larger corpus of Swedish newsreels spanning five decades. As briefly described
in our introduction, we developed a custom transcription tool tool, `stum`, 
and deployed it to 
create individual .srt files for the 4,333(`TODO: confirm count`) film with intertitles. The exact
amount of texts totaled 302,342(`TODO: confirm count`) words. Notably, the intertitles have a lower
capacity for encoding text than a voice narrator, but still remains a vital
source of information about the content of the films. Moreover, beside metadata
about films, it is the only way to cover what the newsreels were actually
about. It should be stressed, however, that given the major amount of catalogue
work that Gardar Sahlberg (and other staff at SVT/SR) did during the 1960s and
1970s, the metadata for most films within the SF-archive is affluent.

In [None]:
# Load intertitle stuff TODO: replace with library once the PRs have gone through.
!wget https://zenodo.org/records/15596192/files/Modern36/journal_digital_corpus-2025.06.04.zip --nc

In [None]:
"""
# skotks terrier https://modern36.github.io/jdc_reader/#q=skottsk+terrier&fuzzy=1&video=sf%2F1928%2FSF604A.1.mpg
# https://github.com/Modern36/journal_digital_corpus/blob/main/corpus/intertitle/sf/1928/SF604A.1.mpg.srt#L25-L28

// hermeneutics for the player

In the following cell we initiate a small video player that loads the sequance
of .png files (frames 14540-14564 from 'SF604A.1.mpg').
It has controls for playback speed, pausing, looping, as scrubber for selecting
a specific frame an option to pause for 2 seconds at a specific frame.
It defaults to the original playback speed of the video file (25fps) and pauses
at the mirrored flash-intertitle frame.
"""

class PNGSequencePlayer:
    def __init__(self, file_paths, fps=24):
        """
        file_paths: A list of string paths to the .png files (sorted).
        fps: Default playback speed.
        """
        # 1. Pre-load all PNG bytes into memory for zero-latency playback
        self.frames_data = []
        try:
            for p in file_paths:
                with open(p, 'rb') as f:
                    self.frames_data.append(f.read())
        except Exception as e:
            print(f"Error loading files: {e}")
            return

        self.n_frames = len(self.frames_data)
        self.is_paused_event = False

        # --- UI Components ---

        # We tell the widget these are PNGs
        self.image_widget = widgets.Image(
            value=self.frames_data[0],
            format='png',
            layout=widgets.Layout(width='auto', max_width='250px')
        )

        # Animation Controller
        self.play_widget = widgets.Play(
            value=0, min=0, max=self.n_frames - 1,
            step=1, interval=int(1000/fps),
            description="Press play", repeat=False
        )

        # Scrubber
        self.slider = widgets.IntSlider(
            value=0, min=0, max=self.n_frames - 1, description="Frame"
        )

        # Controls
        self.fps_input = widgets.BoundedIntText(
            value=fps, min=1, max=120, step=1, description='FPS:',
            layout=widgets.Layout(width='140px')
        )

        self.loop_box = widgets.Checkbox(value=False, description="Loop")

        # Pause Logic UI
        self.pause_enable = widgets.Checkbox(value=True, description="Wait 2s @ Frame:")
        self.pause_idx = widgets.BoundedIntText(
            value=12, min=0, max=self.n_frames-1, description='',
            layout=widgets.Layout(width='60px')
        )

        # --- Logic Wiring ---

        widgets.jslink((self.play_widget, 'value'), (self.slider, 'value'))

        self.slider.observe(self.on_frame_change, names='value')
        self.fps_input.observe(self.update_speed, names='value')
        self.loop_box.observe(self.update_loop, names='value')

        # --- Layout ---

        # Group the specific pause controls
        pause_group = widgets.HBox([self.pause_enable, self.pause_idx])

        controls = widgets.VBox([
            widgets.HBox([self.play_widget, self.slider]),
            widgets.HBox([self.fps_input, self.loop_box, pause_group])
        ])

        self.ui = widgets.VBox([self.image_widget, controls])

    def on_frame_change(self, change):
        frame_idx = change['new']

        # DIRECT BYTE SWAP - Extremely fast
        self.image_widget.value = self.frames_data[frame_idx]

        # Check for "Magic Pause"
        if (self.play_widget.playing and
            self.pause_enable.value and
            frame_idx == self.pause_idx.value and
            not self.is_paused_event):

            self.trigger_pause()

    def trigger_pause(self):
        """Stops animation, waits 2s in background thread, resumes."""
        self.is_paused_event = True
        self.play_widget.playing = False # Stop

        def resume_worker():
            time.sleep(2)
            self.play_widget.playing = True # Resume
            # Tiny buffer to ensure we don't re-trigger on the same millisecond
            time.sleep(0.2)
            self.is_paused_event = False

        threading.Thread(target=resume_worker).start()

    def update_speed(self, change):
        if change['new'] > 0:
            self.play_widget.interval = int(1000 / change['new'])

    def update_loop(self, change):
        self.play_widget.repeat = change['new']

    def show(self):
        display(self.ui)
terrier = PNGSequencePlayer(sorted(Path('./1s/').glob('*.png')))


In [None]:
"""
Blink and you will miss it.

In the below video we have cut 25 frames (1s) from `SF604A.1.mpg` which
showcases several of the technical obstacles of working with this archival footage:

Firstly, the videocompression is very lossy, leading to many blocky artefacts.
Secondly, the flash intertitle is only fully visible in a single frame (frame 12),
making it not only impossible for a viewer to read it -- but also very likely
that they will miss it entirely.
Thirdly, the intertitle is left-right mirrored, making it more difficult to read -- especially for an OCR software.

"""

terrier.show()

In [None]:
"""
Due to the prevalence of `flash intertitles` we need to process every single
frame. However, running every single frame through an intertitle-filter and
subsequent  OCR engine is very inefficient.
We therefore start by grouping the frames by sqeuential similarity -- that is
we group similar images together, and only pass the middle image through
the two step intertitle-filter before passing them through the OCR engine.

The default threshold of `stum` to declare two intertitles 'different' enough
is a Mean Square Error of 10,000. This is a completely arbitrary number
that seems to work well for journal digital -- but might need to be updated
before it can be successfully applied to other collections.

TODO: add reference to original file for the MSE stuff.
"""
### Mean Square Error -- TODO

MSE_THRESHOLD = 10_000

def mse(im1, im2):
    err = np.sum((im1.astype("float") - im2.astype("float")) ** 2)
    err /= float(im1.shape[0] * im1.shape[1])
    return err

def detect_scene_change(im1, im2):
    score = mse(im1, im2)
    return score > MSE_THRESHOLD


In [None]:
"""
Here we load a short selection of prepared .png images and pass them
through the MSE threshold test to show how it is able to detect
the scene -> intertitle changes.

media/terrier_frames/frame_14550.png
ends one scene

media/terrier_frames/frame_14551.png media/terrier_frames/frame_14552.png False
Are a 'scene'

media/terrier_frames/frame_14553.png
is on its own

media/terrier_frames/frame_14554.png
starts a new scene


`stum` then uses the _middle_ image of each grouping as the example image --
avoiding passing every single image throught the filters:

1. contours
2. EAST Text detection
3. Tesseract OCR



"""
terrier_dir = Path('.') / 'media' / 'terrier_frames'
images = sorted(terrier_dir.iterdir())
for path1, path2 in pairwise(images):
    im1 = cv2.imread(str(path1))
    im2 = cv2.imread(str(path2))
    print(path1, path2, detect_scene_change(im1, im2))


In [None]:
# Contours `stum/src/stum/contours.py`

"""
Below is an excerpt from # Contours `stum/src/stum/contours.py` as of (date TODO)
showing `stums`s `contour filter`:
It converts the incoming image to grayscale -- reducing the number of colour channels from three to a single channel.
All our input images are already grayscale, but they are not all a single colour-channel -- this step is therefore necessecary to homogenize the images
for the upcoming processing.

The contours filter step converts the grayscale images to binary --
converting the values of teach pixel from the range(0, 255) to being
a 0 (black) or 1 (white).

It then checks for the largest white contour in the image and calculates its size
relative to the entire image. If the size of the contour is larger than the
threshold (default set to 0.9) the function flags it as a potential intertitle
frame. If the frams was not flagged as an intertitle, there is a chance that
the frame is an intertitle with white text and black background, and the image
is therefore inverted and checked again.

All the heavy lifting in the image-processing is handled by OpenCV.
"""


def largest_contour(binary_image: cv2.typing.MatLike):
    """Returns the relative area of the largest contour of the image"""
    contours = cv2.findContours(
        binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )[0]
    largest = cv2.contourArea(max(contours, key=cv2.contourArea))

    width, height = binary_image.shape
    total_area = width * height
    relative_area = largest / total_area

    return relative_area

def contour_filter(image: cv2.typing.MatLike, threshold=0.9) -> bool:
    """Check if image has one large contour

    If the largest contour is smaller than the complement to the threshold,
    it also calculates the largest contour of the inverted image. This is a
    way to check for images with dark backgrounds and white text.

    Parameters
        image: cv2 image to check
        threshold: threshold to check contour area against, default is 90%

    Returns
        True if image has one contour larger than given threshold
    """
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    binary = cv2.threshold(
        gray, 100, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV
    )[1]

    relative_area = largest_contour(binary)

    if (1 - relative_area) > threshold:
        inverted = cv2.bitwise_not(binary)

        inverteds_largest_area = largest_contour(inverted)

        relative_area = max(relative_area, inverteds_largest_area)

    return relative_area > threshold

#

In [None]:
""" TODO: Update to use the native `Corpus` object loading the intertitles.

In preparation for the distant reading of the intertitles we extract the
subdirectory holding the .srt fioes from the downloaded
.zip file of the Journal Digital Corpus.


"""
corpus_path = Path('.') / 'corpus'

if not corpus_path.exists():
    with zipfile.ZipFile('journal_digital_corpus-2025.06.04.zip') as zip_f:
        corpus_name = [_ for _ in zip_f.namelist() if '/corpus/' in _]
        print(corpus_name)
        for target in  corpus_name:
            zip_f.extract(target, 'tmp')
        !mv tmp/Modern36-journal_digital_corpus-c1e6cdf/corpus corpus/
        !rmdir tmp/*
        !rmdir tmp

In [None]:
"""
TODO: Refactor to use journal_digita.Corpus
"""
def load_intertitles():
    for srt in corpus_path.glob('intertitle/sf/**/*.srt'):
        year = srt.parent.name
        if not year.isdigit():
            continue
        year = int(year)
        if year < 1900:
            continue
        with open(srt, 'r', encoding='utf8') as f:
            content = f.read()
        intertitles =  [line  for i, line in enumerate(content.split('\n')) if i % 4 == 2]
        words = len([word for intertitle in intertitles for word in intertitle.split()])
        words_per_intertitle = words / len(intertitles)

        yield {
            'file' : srt.name,
            'year': year,
            'num_intertitles' : len(intertitles),
            'num_words' : words,
            'words_per_intertitle': words_per_intertitle
        }

df = pd.DataFrame(load_intertitles())

In [None]:
"""
Bootstrapping confidence interval for visualization.
"""
def bootstrap_confidence_interval(data, n_bootstrap=1000, ci=95):
    """
    Calculates the mean and bootstrapped confidence interval for a dataset.
    """
    bootstrap_means = np.zeros(n_bootstrap)

    for i in range(n_bootstrap):
        # Create a resample of the data (sampling with replacement)
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_means[i] = np.mean(bootstrap_sample)

    # Calculate the lower and upper bounds of the confidence interval
    lower_bound = np.percentile(bootstrap_means, (100 - ci) / 2)
    upper_bound = np.percentile(bootstrap_means, 100 - (100 - ci) / 2)

    # Return the statistics in a pandas Series for easy aggregation
    return pd.Series({
        'mean': np.mean(data),
        'ci_lower': lower_bound,
        'ci_upper': upper_bound
    })


In [None]:
# TODO: Fix X label, Y label and remove Legend
def plot_with_ci(col, title):
    summary_df = df.groupby('year')[col].apply(bootstrap_confidence_interval).unstack().reset_index()


    # Create the figure with continuous error bands
    fig = go.Figure([
        # The mean line
        go.Scatter(
            x=summary_df['year'],
            y=summary_df['mean'],
            line=dict(color='rgb(0,100,80)'),
            mode='lines',
            name='Mean'
        ),
        # The confidence interval error band
        go.Scatter(
            x=np.concatenate([summary_df['year'], summary_df['year'][::-1]]), # x, then x reversed
            y=np.concatenate([summary_df['ci_upper'], summary_df['ci_lower'][::-1]]), # upper, then lower reversed
            fill='toself',
            fillcolor='rgba(0,100,80,0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            hoverinfo="skip",
            showlegend=True,
            name='95% Confidence Interval'
        )
    ])

    fig.update_layout(
        title=title,
        xaxis_title='X Value',
        yaxis_title='Mean of Y Value'
    )
    return fig


In [None]:
plot_with_ci('num_words', title='Average number of words per video')

In [None]:
plot_with_ci('words_per_intertitle', title='Changes in the average number of words per intertitle')

In [None]:
# Wordcloud
"""
TODO: refactor loading to use journal_digital.Corpus

In this step we count all the words from the intertitles in order to visualise
the most commonly occuring words in a wordcloud.

Before we count them, we filter out short words (less than 4 characters)
remove numbers and words that are just repetitions of a single character.

The counting is performed by SciKit-Learns CountVectorizer, which is
both optimized and takes a stopwords list.

Since we are working with a, primarily, Swedish corpus, we are relying
on a stopwords list published under an MIT license by the 'stopwords-iso' project on GitHub:
https://github.com/stopwords-iso/stopwords-iso/tree/master

The counted words are then passed throught WordCloud from the wordcloud library and visualized with Plotly.


TODO: Add ref to stopwords-iso for the stopwordslist
"""

def load_intertitle_texts():
    for srt in corpus_path.glob('intertitle/sf/**/*.srt'):
        year = srt.parent.name
        if not year.isdigit():
            continue
        year = int(year)
        if year < 1900:
            continue
        with open(srt, 'r', encoding='utf8') as f:
            content = f.read()
        intertitles =  [line  for i, line in enumerate(content.split('\n')) if i % 4 == 2]
        yield ' '.join([word for intertitle in intertitles for word in intertitle.split() if len(word) > 3 and not word.isdigit() and not len(set(word)) == 1])

texts = list(load_intertitle_texts())

stopwords_file = Path('.') / 'stopwords-sv.txt'
if not stopwords_file.exists():
    !wget https://raw.githubusercontent.com/stopwords-iso/stopwords-sv/master/stopwords-sv.txt

stopwords = Path("stopwords-sv.txt").read_text(encoding="utf-8").splitlines()

vec = CountVectorizer(
    stop_words=stopwords,
    max_features=300
)

matrix = vec.fit_transform(texts)
counts = matrix.sum(axis=0).A1
freq_dict = dict(zip(vec.get_feature_names_out(), counts))
sorted_freqs = sorted(freq_dict.items(), key=lambda item: item[1], reverse=True)


def update_cloud(num_words, scale):
    top_n_dict = dict(sorted_freqs[:num_words])

    wc = WordCloud(
        width=800,
        height=400,
        background_color='white',
        scale=scale # Controls resolution
    ).generate_from_frequencies(top_n_dict)

    fig = px.imshow(wc)
    fig.update_layout(
        xaxis={'visible': False},
        yaxis={'visible': False},
        margin={'t': 0, 'b': 0, 'l': 0, 'r': 0}
    )
    fig.show()

In [None]:
"""
Explanation?
Interpretation?
"""
w_slider = widgets.IntSlider(value=10, min=1, max=150, step=1, description='Words:', continuous_update=False)
scale_slider = widgets.FloatSlider(value=1.0, min=1.0, max=3.0, step=0.5, description='Resolution:')

widgets.interactive(update_cloud, num_words=w_slider, scale=scale_slider)

In [None]:
# Staden Vs by

In [None]:
# Load NER and GEO stuff

# Bibliography


<div class="cite2c-biblio"></div>