# Dutch Corpora

Before running this notebook, install the `tempo-embeddings` package in your local Git repository:

```bash
pip install .
```

See [README.dev.md](../README.dev.md) for more details.

The following command does the same, checking out the repository and installing the current version:

In [1]:
# This can also refer to a specific version or branch

%pip install --upgrade pip  # Required for properly resolving dependencies
%pip install --upgrade git+https://github.com/Semantics-of-Sustainability/tempo-embeddings.git


Note: you may need to restart the kernel to use updated packages.
Collecting git+https://github.com/Semantics-of-Sustainability/tempo-embeddings.git
  Cloning https://github.com/Semantics-of-Sustainability/tempo-embeddings.git to /private/var/folders/d8/j5_fyf8941j_492zvf8948y40000gn/T/pip-req-build-gncj2177
  Running command git clone --filter=blob:none --quiet https://github.com/Semantics-of-Sustainability/tempo-embeddings.git /private/var/folders/d8/j5_fyf8941j_492zvf8948y40000gn/T/pip-req-build-gncj2177
  Resolved https://github.com/Semantics-of-Sustainability/tempo-embeddings.git to commit 2f37ecb33ba20a9360f27369b91e2c6453a7cf12
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: tempo-embeddings
  Building wheel for tempo-embeddings (pyproject.toml) ... [?25ldone
[?25h  Created wheel for tempo-embeddings: filename=tempo_embedd

In [17]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

## Load Model

In [19]:
%autoreload now

LAYER=12

from tempo_embeddings.embeddings.model import (
    RobertaModelWrapper,
    TransformerModelWrapper,
    XModModelWrapper,
)

kwargs = {"accelerate": True}

# MODEL_NAME = "DTAI-KULeuven/robbertje-1-gb-non-shuffled"
# model_class = RobertaModelWrapper

MODEL_NAME = "facebook/xmod-base"
kwargs["default_language"] = "nl_XX"
model_class = XModModelWrapper


# MODEL_NAME = "xlm-roberta-base"
# MODEL_NAME = "xlm-mlm-100-1280"
# model_class = TransformerModelWrapper

In [20]:
model = model_class.from_pretrained(MODEL_NAME, layer=LAYER, **kwargs)

Some weights of XmodModel were not initialized from the model checkpoint at facebook/xmod-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Load Data

The data needs to be downloaded and provided in the path configured in the next cell.

NOTE: You have to manually adapt the `DATA_DIR` below.

In [21]:
%autoreload now

import operator
from functools import reduce
from pathlib import Path
from tqdm import tqdm
from tempo_embeddings.text.corpus import Corpus

In [22]:
WINDOW_SIZE = 200

RANDOM_SAMPLE_ANP = 0
RANDOM_SAMPLE_STATEN_GENERAAL = 200

STATEN_GENERAAL_BLACKLIST = ["1987"]

FILTER_TERMS = ["duurzaam"]  # Search term(s) for filtering the corpus

In [23]:
### NOTE: Adapt the `DATA_DIR` below manually!
### For a shared Google Drive, create a shortcut into your own Google Drive
### See https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab

LOCAL_PATHS: list[Path] = [
    Path.home() / "Documents" / "SemanticsOfSustainability" / "data" / "Joris",
    Path("/home/cschnober/data/"),  # Snellius
]

if IN_COLAB:
    from google.colab import drive

    drive.mount("/content/drive")

    DATA_DIR = Path("/content/drive/MyDrive/Data/")
else:
    try:
        DATA_DIR = next(path for path in LOCAL_PATHS if path.is_dir())
    except StopIteration as e:
        raise DirectoryNotFoundError(f"Data directory not found.") from e

assert DATA_DIR.is_dir(), f"Data dir '{DATA_DIR}' not found."

### ANP

In [24]:
ANP_DIR = DATA_DIR / "ANP" / "CleanFiles_perYear"
assert RANDOM_SAMPLE_ANP == 0 or ANP_DIR.is_dir()

In [25]:
import random

random.seed(0)

files = random.sample(list(ANP_DIR.glob("ANP_????.csv.gz")), k=RANDOM_SAMPLE_ANP)
files[:10]

[]

In [26]:
anp_corpus = (
    reduce(
        operator.add,
        (
            Corpus.from_csv_file(
                path,
                model=model,
                filter_terms=FILTER_TERMS,
                text_columns=["content"],
                encoding="iso8859_15",
                compression="gzip",
                delimiter=";",
                window_size=WINDOW_SIZE,
            )
            for path in tqdm(files, unit="file")
        ),
    )
    if files
    else Corpus(model=model)
)

len(anp_corpus)

0

### Staten Generaal

In [27]:
STATEN_GENERAAL_DIR = DATA_DIR / "StatenGeneraal"

assert RANDOM_SAMPLE_STATEN_GENERAAL == 0 or STATEN_GENERAAL_DIR.is_dir()

In [93]:
### Load random files:
# files = random.sample(
#     list(STATEN_GENERAAL_DIR.glob("StatenGeneraal_????.csv.gz")),
#     k=RANDOM_SAMPLE_STATEN_GENERAAL,
# )

glob195x = "StatenGeneraal_19[5-9]?.csv.gz"  # Pattern for files from 1950-1999
glob20xx = "StatenGeneraal_2???.csv.gz"  # Pattern for files from 2000

files_195x = list(STATEN_GENERAAL_DIR.glob(glob195x))
files_20xx = list(STATEN_GENERAAL_DIR.glob(glob20xx))

files = [
    file
    # Merge files from patterns
    for file in files_20xx + files_195x
    # Remove blacklisted files:
    for blacklisted in STATEN_GENERAAL_BLACKLIST
    if blacklisted not in file.name
]

sorted(files)

[PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1950.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1951.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1952.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1953.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1954.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1955.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal_1956.csv.gz'),
 PosixPath('/Users/carstenschnober/Documents/SemanticsOfSustainability/data/Joris/StatenGeneraal/StatenGeneraal

In [31]:
%autoreload now

import csv

csv.field_size_limit(100000000)

sg_corpus = (
    reduce(
        operator.add,
        (
            Corpus.from_csv_file(
                path,
                model=model,
                filter_terms=FILTER_TERMS,
                text_columns=["Content"],
                encoding="utf-8",
                compression="gzip",
                delimiter=";",
                window_size=WINDOW_SIZE,
            )
            for path in tqdm(files, unit="file")
        ),
    )
    if files
    else Corpus(model=model)
)

len(sg_corpus)

100%|██████████| 68/68 [01:09<00:00,  1.01s/file]


25652

In [32]:
for p in sg_corpus.passages[:20]:
    print(len(p), p)

200 Passage('geweest van een duurzaam proces van versterking van de positie van de provincie. Het wetsvoorstel wil daarin dan ook geen wijziging brengen. Dat lijkt ons op dit ogenblik een goed uitgangspunt. Het is', {'': '109', 'RecId': 'h-ek-20022003-421-456', 'chamber': 'EersteKamer', 'date': '2003-01-14', 'speakers': 'De heer Witteveen PvdA|Mevrouw Meindertsma PvdA|De heer Platvoet GroenLinks|De heer Dölle CDA|De heer Witteveen PvdA|De heer Dölle CDA|De heer Holdijk SGP|De heer Platvoet GroenLinks|De heer Holdijk SGP|De heer Platvoet GroenLinks|De heer Holdijk SGP|De heer Bierman OSF|De heer Terlouw D66|De heer Bierman OSF|De heer Terlouw D66|De heer Platvoet GroenLinks|De heer Terlouw D66|De heer Dölle CDA|De heer Terlouw D66|De heer Luijten VVD|Minister Remkes|De heer Witteveen PvdA|Minister Remkes|De heer Witteveen PvdA|Minister Remkes|De heer Platvoet GroenLinks|Minister Remkes|Mevrouw Meindertsma PvdA|Minister Remkes|De heer Terlouw D66|Minister Remkes|De heer Platvoet GroenLin

### Merge

In [33]:
corpus = anp_corpus + sg_corpus
len(corpus)

25652

## Compute Embeddings

In [34]:
from tempo_embeddings.embeddings.model import EmbeddingsMethod

model.batch_size = 32
model.embeddings_method = EmbeddingsMethod.CLS

corpus.compute_embeddings()

Embeddings: 100%|██████████| 802/802 [03:55<00:00,  3.41batch/s]


## Read Stopwords

In [35]:
!wget --continue https://raw.githubusercontent.com/Semantics-of-Sustainability/tempo-embeddings/main/tempo_embeddings/data/stopwords-filter-nl.txt

stopwords_file = Path("stopwords-filter-nl.txt")

with open(stopwords_file.absolute(), "rt") as f:
    stopwords = set(f.read().splitlines())

stopwords.update({"wij", "we", "moeten"})

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2023-09-08 14:27:04--  https://raw.githubusercontent.com/Semantics-of-Sustainability/tempo-embeddings/main/tempo_embeddings/data/stopwords-filter-nl.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



## Cluster

In [89]:
%autoreload now

from tempo_embeddings.text.cluster import Cluster

cluster = Cluster(corpus, vectorizer=None, n_topic_words=2)


In [90]:
cluster.cluster(stopwords=stopwords, min_samples=5, cluster_selection_epsilon=0.0)

# Arguments: min_cluster_size=10, cluster_selection_epsilon=0.1, ...
# See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html for full list



['duurzaam; goed',
 'duurzaam; duurzaamheid',
 'Outliers',
 'duurzaam; kamer',
 'duurzaamheid; duurzaam',
 'duurzaam; duurzaamheid',
 'duurzaam; arbeidsongeschikt',
 'duurzaam; volledig',
 'duurzaam; duurzaamheid',
 'duurzaamheid; duurzaam',
 'duurzaam; alle',
 'duurzaamheid; duurzaam',
 'duurzaam; duurzaamheid',
 'duurzaam; heer',
 'heel; duurzaam',
 'duurzaamheid; duurzaam',
 'duurzaam; mensen',
 'duurzaam; duurzaamheid',
 'duurzaamheid; duurzaam',
 'duurzaamheid; duurzaam',
 'hout); geproduceerd',
 'duurzaamheid; sociale',
 'duurzaamheid; duurzaam',
 'duurzaam; energie',
 'duurzaam; nederland',
 'duurzaam; energie',
 'duurzaam; duurzaamheid',
 'duurzaam; duurzaamheid',
 'duurzaam; duurzaamheid',
 'duurzaam; duurzaamheid',
 'duurzaam; energie',
 'duurzaam; heer',
 'duurzaamheid; mevrouw',
 'duurzaam; duurzaamheid',
 'duurzaamheid; duurzaam',
 'duurzaam; duurzaamheid',
 'nr; krijgt',
 'duurzaam; pensioenstelsel',
 'duurzaam; duurzaamheid',
 'duurzaam; duurzaamheid',
 'duurzaam; duurza

# Visualize Embeddings

In [91]:
cluster.scatter_plot()

ValueError: Requested 886 colors, function can only return colors up to the base palette's length (256)

In [92]:
cluster.visualize()

### TODO: Refine sub-corpora

## Wizmap

In [40]:
from tempo_embeddings.visualization.wizmap import WizmapVisualizer

port = 8000

if "wizmap_visualizer" in locals():
    # Cleanup previous run
    wizmap_visualizer.cleanup()


wizmap_visualizer = WizmapVisualizer(
    corpus, title=FILTER_TERMS[0], stopwords=list(stopwords)
)

wizmap_visualizer.visualize(port=port)



Starting server on port 8000


In [41]:
if False:
    wizmap_visualizer.cleanup()

127.0.0.1 - - [08/Sep/2023 14:29:53] "GET /grid.json HTTP/1.1" 200 -
127.0.0.1 - - [08/Sep/2023 14:29:54] "GET /data.ndjson HTTP/1.1" 200 -
