# Preprocessing - classificatie van nieuwsartikelen

In deze notebook gaan we nieuwsartikelen classificeren. Om dit met pytorch zelf te doen heb je wat meer manueel preprocessing werk aangezien de torchtext momenteel niet ondersteund wordt in combinatie met de laatste versie met pytorch.
Om deze reden gaan we hier verder werken met Keras met een pytorch backend.

In [7]:
# Import necessary libraries
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GlobalAveragePooling1D
from keras_preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import opendatasets as od

## Data inladen - Nieuwsberichten

We gebruiken de AG_NEWS dataset die gedownload kan worden van kaggle met deze link: https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset

In [8]:
od.download("https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset")

# Load the dataset
def read_csv(filename):
    df = pd.read_csv(filename)
    df.columns = ["label", "title", "description"]
    df["text"] = df['title'] + ' ' + df['description']
    df['label'] = df['label'] - 1
    return df

df_train = read_csv('./ag-news-classification-dataset/train.csv')
display(df_train.head())

df_test = read_csv('./ag-news-classification-dataset/test.csv')

Skipping, found downloaded files in "./ag-news-classification-dataset" (use force=True to force download)


Unnamed: 0,label,title,description,text
0,2,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
1,2,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
2,2,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,2,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
4,2,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."


## Preprocessing - tokenizer

In [12]:
MAX_NUM_WORDS = 20000     # aantal wworden in de woordenboek
MAX_SEQUENCE_LENGTH = 50  # maximum lengte van een zin die we gebruiken (pad of truncate indien te kort of te lang)
EMBEDDING_DIM = 100       # aantal embedding dimensies die gebrukt worden

def preprocess(df, tokenizer=None):
    if tokenizer is None:
        # train modus
        tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
        tokenizer.fit_on_texts(df['text']) # train de tokenizer in train modus

    sequences = tokenizer.texts_to_sequences(df['text']) # zin naar tokens
    X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
    return X, sequences, tokenizer

X_train, y_train, tokenizer = preprocess(df_train)
print(X_train.shape)
X_test, y_test, tokenizer = preprocess(df_test, tokenizer) # geef hier de tokenizer zodat die NIET opnieuw gefit wordt

(120000, 50)


## Neuraal netwerk met embedding

In [13]:
# Sequentieel model met Keras
model = Sequential() 
model.add(Embedding(MAX_NUM_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))

# forward pass
model(X_train).shape

I0000 00:00:1733819645.435095      60 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-12-10 08:34:05.440171: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2343] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


TensorShape([120000, 50, 100])