Wahlpflichtfach Künstliche Intelligenz II: Praktikum 

---

# 01 - Natural Language Processing (NLP)

## Daten importieren und analysieren

Im Folgenden wird ein [imdb](https://huggingface.co/datasets/imdb) Datensatz mit Reviews verwendet, um herauszufinden ob dieses Positiv oder negativ war.

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
dataset.set_format(type="pandas")
dataset["train"][:]

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


## Tokenizer und Model

In [4]:
from transformers import TFDistilBertForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)

2024-02-19 20:37:59.514925: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-19 20:37:59.541214: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-19 20:37:59.541253: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-19 20:37:59.542943: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-19 20:37:59.542975: I external/local_xla/xla/stream_executor

In [5]:
example = "I think I will make a movie next weekend. Oh wait, I'm working..oh I'm sure I can fit it in. It looks like whoever made this film fit it in. I hope the makers of this crap have day jobs because this film sucked!!! It looks like someones home movie and I don't think more than $100 was spent making it!!! Total crap!!! Who let's this stuff be released?!?!?!"
token_ids = tokenizer(example, return_tensors="tf")["input_ids"]
token_ids

<tf.Tensor: shape=(1, 103), dtype=int32, numpy=
array([[  101,   146,  1341,   146,  1209,  1294,   170,  2523,  1397,
         5138,   119,  2048,  3074,   117,   146,   112,   182,  1684,
          119,   119,  9294,   146,   112,   182,  1612,   146,  1169,
         4218,  1122,  1107,   119,  1135,  2736,  1176, 11830,  1189,
         1142,  1273,  4218,  1122,  1107,   119,   146,  2810,  1103,
        12525,  1104,  1142, 11074,  1138,  1285,  5448,  1272,  1142,
         1273,  8286,   106,   106,   106,  1135,  2736,  1176,  1800,
         1116,  1313,  2523,  1105,   146,  1274,   112,   189,  1341,
         1167,  1190,   109,  1620,  1108,  2097,  1543,  1122,   106,
          106,   106,  8653, 11074,   106,   106,   106,  2627,  1519,
          112,   188,  1142,  4333,  1129,  1308,   136,   106,   136,
          106,   136,   106,   102]], dtype=int32)>

In [6]:
tokenizer.convert_ids_to_tokens(token_ids[0])

['[CLS]',
 'I',
 'think',
 'I',
 'will',
 'make',
 'a',
 'movie',
 'next',
 'weekend',
 '.',
 'Oh',
 'wait',
 ',',
 'I',
 "'",
 'm',
 'working',
 '.',
 '.',
 'oh',
 'I',
 "'",
 'm',
 'sure',
 'I',
 'can',
 'fit',
 'it',
 'in',
 '.',
 'It',
 'looks',
 'like',
 'whoever',
 'made',
 'this',
 'film',
 'fit',
 'it',
 'in',
 '.',
 'I',
 'hope',
 'the',
 'makers',
 'of',
 'this',
 'crap',
 'have',
 'day',
 'jobs',
 'because',
 'this',
 'film',
 'sucked',
 '!',
 '!',
 '!',
 'It',
 'looks',
 'like',
 'someone',
 '##s',
 'home',
 'movie',
 'and',
 'I',
 'don',
 "'",
 't',
 'think',
 'more',
 'than',
 '$',
 '100',
 'was',
 'spent',
 'making',
 'it',
 '!',
 '!',
 '!',
 'Total',
 'crap',
 '!',
 '!',
 '!',
 'Who',
 'let',
 "'",
 's',
 'this',
 'stuff',
 'be',
 'released',
 '?',
 '!',
 '?',
 '!',
 '?',
 '!',
 '[SEP]']

In [10]:
from tensorflow.data import Dataset

train_embeddings = tokenizer(list(dataset["train"]["text"][:1000]), truncation=True, padding=True)
test_embeddings = tokenizer(list(dataset["test"]["text"][:100]), truncation=True, padding=True)

train = Dataset.from_tensor_slices((dict(train_embeddings), dataset["train"]["label"][:1000]))
test = Dataset.from_tensor_slices((dict(test_embeddings), dataset["test"]["label"][:100]))

### Modeling

In [12]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

optimizerr = Adam(learning_rate=5e-5)
losss = SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizerr, loss=losss, metrics=['accuracy'])
model.fit(train.shuffle(1000).batch(16), epochs=1, batch_size=16)

2024-02-19 20:39:23.036755: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-02-19 20:39:24.221265: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f61608b0080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-02-19 20:39:24.221294: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4070 Ti, Compute Capability 8.9
2024-02-19 20:39:24.224511: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-02-19 20:39:24.234761: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
I0000 00:00:1708371564.286710    3091 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.




<keras.src.callbacks.History at 0x7f615cd4d290>

# Evaluation

In [14]:
loss, acc = model.evaluate(test.batch(16))
acc



1.0

### Modeling

---

Wahlpflichtach Künstliche Intelligenz II: Praktikum 