<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/tda_2023_exercises_1_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YLE news data

* We'll be using data from YLE news for many of the demos
* This data is gathered from 2021 YLE RSS feed
* It's here: http://dl.turkunlp.org/TKO_8964_2023/

# Task 1

* Grab the data with `wget` and give it a look
* Think what kinds of NLP tasks you could use the data for?

In [1]:
!wget http://dl.turkunlp.org/TKO_8964_2023/news-fi-2021.jsonl

--2023-01-15 12:30:20--  http://dl.turkunlp.org/TKO_8964_2023/news-fi-2021.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36139303 (34M) [application/octet-stream]
Saving to: ‘news-fi-2021.jsonl’


2023-01-15 12:30:21 (32.2 MB/s) - ‘news-fi-2021.jsonl’ saved [36139303/36139303]



The data can be used for, e.g.:


*   Retrieval (as we did)
*   Classification / topic labeling (there seem to be topic labels)
*   Summarization



# Task 2

* Load the data as a HuggingFace dataset
* Remember to pip-install datasets first
* https://huggingface.co/docs/datasets/loading#json

In [2]:
!pip3 install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0

In [3]:
import datasets
dset=datasets.load_dataset("json", data_files="news-fi-2021.jsonl")



Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ebbaba727a3fbc92/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ebbaba727a3fbc92/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
for item in dset["train"]:
    print(item.keys())
    break

dict_keys(['summary', 'tags', 'text', 'timestamp', 'title', 'url'])


# Task 3

- Try some keyword search on the data
- Run the data through the trusty CountVectorizer
- Do some keyword search, e.g. look for "Turku" AND "silta" and print the titles of the news that match (do the search on the text field, though)
- Some hints:
  - if you just copy the code from the lecture, you will get out-of-memory error
  - can you think why? what is the data structure returned by the vectorizer?
  - the vocabulary of the vectorizer is in `cv.vocabulary_`
  - you may still need `.todense()` but maybe in a different spot
  - you may need `.nonzero()` to gather the matching documents
  


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(lowercase=False,binary=True)
td_matrix=cv.fit_transform(item["text"] for item in dset["train"]).T #here one should remove todense() not to blow memory limit

In [6]:
u=td_matrix[cv.vocabulary_["Turku"]]
m=td_matrix[cv.vocabulary_["silta"]]

In [7]:
result=u.todense() & m.todense() # boolean search does not work on sparse representation, so one needs to make the rows dense one at a time, this is comparatively cheap
result.nonzero()

(array([0, 0, 0]), array([1651, 1660, 3997]))

In [8]:
for j in result.nonzero()[1]:
    print(dset["train"][int(j)]["title"])

Tunteita herättänyt ja ilkivallan kohteeksikin joutunut Turun Teatterisillan sateenkaaritähti poistetaan perjantaina
Yli 21 miljoonaa maksanut Logomon silta hämmentää Turussa: Miten sinne pääsee, miksi silta loppuu kesken? Saimme vastauksia kysymyksiin
Vuosien työ tulee päätökseen – Turun Logomon silta sai viimein avajaispäivän


# Task 4

* This is for those who run a little ahead
* Run the dataset through a HuggingFace pipeline as follows:
  * Model: `xlm-roberta-base` (or any other similar model of your choice)
  * Task: `feature-extraction`
  * Relevant documentation: https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines#pipeline-batching
  * You will likely want to use the GPU
  * Make sure you understand what the return values are


In [9]:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.25.1


In [10]:
import transformers


In [17]:
p=transformers.pipeline(task="feature-extraction",model="xlm-roberta-base",return_tensors=True, device=0)
for x in p(transformers.pipelines.pt_utils.KeyDataset(dset["train"], "text"), batch_size=8, truncation="only_first"):
    print(x.shape)
    break


Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 512, 768])


What we get back is an iterable with embedding for each item. Here the embedding seems to be 512x768 where 512 is typically the maximum sequence length, and 768 the embedding dimensionality; so this is an embedding of each token in the first 512 tokens in the text

In [18]:
print(x.device)

cpu
