# Data

You are welcome to bring your own data and use it with this repository.
If you want to do so, you should ideally place it as a `json`-file in the `data` folder.
The `json`-file should contain a list of dictionaries, where each dictionary represents a single data point.

```json
[
    {
        "text": "This is the first text.",
        "label": 0
    },
    {
        "text": "This is the second text.",
        "label": 1
    }
]
```

If you don't have any labels for your data, just set the `label` to `-1`.

If you don't want to or don't have any data to bring, you can just follow along with this notebook to download two sample datasets.

## Sample Datasets

This notebook is meant to be run first and only once. It downloads the datasets used in this project and saves them in the `data` folder.
Feel free to skip this notebook and download the datasets manually, or copy your own datasets to the `data` folder.

The two datasets used in this course are open source and hosted by [huggingface](https://huggingface.co/datasets).
We will refer to the first one as the `articles` dataset and the second one as the `headlines` dataset.
The `articles` dataset contains $\sim 3'500$ news articles from 8 different categories. The `headlines` dataset contains over $100'000$ headlines, divided into four categories. To make this second dataset more tractable, we will only use a random sample of $10'000$ headlines.

By default, this notebook will operate on the `articles` dataset.

In [3]:
from datasets import load_dataset
from os import makedirs
from os.path import join
import json
import random

makedirs('headlines', exist_ok=True)
makedirs('articles', exist_ok=True)


### Download the News Headlines

In [6]:
dataset = load_dataset("ag_news")
dataset = list(dataset['train'])

random.seed(42)

# We only want to use a random sample of 10'000 headlines
# You can comment out this line if you want to use the full dataset
dataset = random.sample(dataset, 10000)

# Sample 80% of the dataset for training
dataset_split = set(random.sample(
    range(len(dataset)), int(0.8 * len(dataset))))
dataset_train = [dataset[i] for i in dataset_split]
dataset_test = [dataset[i]
                for i in range(len(dataset)) if i not in dataset_split]

with open(join('headlines', 'train.json'), 'w') as f:
    json.dump(dataset_train, f, ensure_ascii=False, indent=4)

with open(join('headlines', 'test.json'), 'w') as f:
    json.dump(dataset_test, f, ensure_ascii=False, indent=4)


Found cached dataset ag_news (/Users/danielmatter/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)


  0%|          | 0/2 [00:00<?, ?it/s]

### Download the News Articles

In [6]:
dataset = load_dataset("valurank/News_Articles_Categorization")
dataset = [{'text': d['Text'], 'label': d['Category']}
           for d in dataset['train']]

# Sample 80% of the dataset for training
random.seed(42)
dataset_split = random.sample(range(len(dataset)), int(0.8 * len(dataset)))
dataset_train = [dataset[i] for i in dataset_split]
dataset_test = [dataset[i]
                for i in range(len(dataset)) if i not in dataset_split]

with open(join('articles', 'train.json'), 'w') as f:
    json.dump(dataset_train, f, ensure_ascii=False, indent=4)

with open(join('articles', 'test.json'), 'w') as f:
    json.dump(dataset_test, f, ensure_ascii=False, indent=4)


Found cached dataset csv (/Users/danielmatter/.cache/huggingface/datasets/valurank___csv/valurank--News_Articles_Categorization-65a7be3608ffc54e/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

## Exploring the Data

It is always worth having a glance at your data with your loading method before you start working with it.
This allows you to ensure you have setup everything correctly, both in retrieving, as well as in loading the data.

For this course, all datasets are downloaded into `.json` files, that contain an array of `text` and `label` keys.

In [17]:
with open(join('articles', 'test.json'), 'r') as f:
    dataset = json.load(f)


Let's start by extracting all the labels from the `articles` dataset.

In [18]:
labels = set([d['label'] for d in dataset])
print(labels)


{'Entertainment', 'Business', 'Sports', 'Politics', 'Tech', 'Health', 'World', 'science'}


Let's also have a look at the beginning of the eight first articles.

In [16]:
for d in dataset[:8]:
    print("From {}:".format(d['label']))
    print(" " + d['text'][:100] + "...")
    print()


From Entertainment:
 Elon Musk, Amber Heard Something's Fishy On Wrapped-Up Sushi Last we heard, Elon Musk and Amber Hear...

From Sports:
 Before Coming Out, a Hard Time Growing UpVideoMichael Sam, a defensive end at Missouri who will ente...

From Tech:
 Fortnites parent company, Epic Games, had broken its contract with Apple, a federal judge found. The...

From World:
 The InterpreterCredit...Jean-Paul Pelissier/ReutersMarch 6, 2017BERLIN An idea, once unthinkable, is...

From Health:
 Credit...Craig Sherod/AcelRx Pharmaceuticals, via Associated PressNov. 2, 2018WASHINGTON The Food an...

From science:
 While the unknowns about coronavirus abound, a new study finds we can handle the truth.Credit...Step...

From Tech:
 The New New WorldThe Chinese telecom giant seeks acceptance in the West, but its structure and value...

From Business:
 Special Report: Energy for TomorrowDec. 7, 2015PARIS The weeks leading up to the United Nations glob...



Lastly, we are gonna load the first article and have a look at its content.

In [31]:
words = dataset[3]['text'].split(" ")
for s in range(0, len(words), 10):
    print(" ".join(words[s:s+10]))


The InterpreterCredit...Jean-Paul Pelissier/ReutersMarch 6, 2017BERLIN An idea, once unthinkable, is
gaining attention in European policy circles: a European Union nuclear
weapons program.Under such a plan, Frances arsenal would be repurposed
to protect the rest of Europe and would be put
under a common European command, funding plan, defense doctrine, or
some combination of the three. It would be enacted only
if the Continent could no longer count on American protection.Though
no new countries would join the nuclear club under this
scheme, it would amount to an unprecedented escalation in Europes
collective military power and a drastic break with American leadership.Analysts
say that the talk, even if it never translates into
action, demonstrates the growing sense in Europe that drastic steps
may be necessary to protect the postwar order in the
era of a Trump presidency, a resurgent Russia and the
possibility of an alignment between the two.Even proponents, who remain
a minority, ackn