# 📂 **Project: Automatic Argumentation Analysis - Naive Approach**

In this notebook, we implement a **basic argument mining system** from scratch using the [PyTorch](https://pytorch.org) library. This **naive approach** serves as an initial step in exploring **argumentation analysis** by focusing on simplified techniques, which are essential for understanding the core components of the task.

The implementation includes **simple modeling approaches**, closely mirroring functionalities often found in **advanced libraries** like [Huggingface](https://huggingface.co). This makes the project an excellent starting point for learning about the fundamental building blocks of **argument mining**, including **tokenizers**, **datasets**, and **basic models**. It provides a solid foundation to build on and transition to more sophisticated methods, such as using **Large Language Models (LLMs)**.

This naive implementation also lays the groundwork for the **enhanced approach** that we will explore later, where we replace these basic techniques with **state-of-the-art LLMs** to improve accuracy and scalability.

Authors: **Nassim Lattab** and **Mohamed Azzaoui**.

---



### Preparing the environment and uploading data


In [None]:
# %cd /content
# !ls
# !rm -rf /content/*

# !git clone https://github.com/bencrabbe/argumentation_base.git
# %cd argumentation_base

# !git pull origin main

# %cd data
# !sh download_data.sh

# %cd ..

/content
argumentation_base
Cloning into 'argumentation_base'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 77 (delta 35), reused 61 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (77/77), 18.50 KiB | 1.32 MiB/s, done.
Resolving deltas: 100% (35/35), done.
/content/argumentation_base
From https://github.com/bencrabbe/argumentation_base
 * branch            main       -> FETCH_HEAD
Already up to date.
/content/argumentation_base/data
--2024-12-25 19:42:59--  https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/2422/ArgumentAnnotatedEssays-2.0.zip
Resolving tudatalib.ulb.tu-darmstadt.de (tudatalib.ulb.tu-darmstadt.de)... 130.83.152.157
Connecting to tudatalib.ulb.tu-darmstadt.de (tudatalib.ulb.tu-darmstadt.de)|130.83.152.157|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: unspecified [application/zip]
Saving to: ‘ArgumentA

In [None]:
!python view_data.py abstrct_neoplasm_train.json

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
49	assigned	O
50	1077	O
51	patients	O
52	to	O
53	receive	O
54	docetaxel	O
55	at	O
56	75	O
57	mg	O
58	/	O
59	m2	O
60	of	O
61	body	O
62	surface	O
63	area	O
64	(	O
65	1	O
66	-	O
67	hour	O
68	intravenous	O
69	infusion	O
70	)	O
71	or	O
72	paclitaxel	O
73	at	O
74	175	O
75	mg	O
76	/	O
77	m2	O
78	(	O
79	3	O
80	-	O
81	hour	O
82	intravenous	O
83	infusion	O
84	)	O
85	.	O
86	Both	O
87	treatments	O
88	then	O
89	were	O
90	followed	O
91	by	O
92	carboplatin	O
93	to	O
94	an	O
95	area	O
96	under	O
97	the	O
98	plasma	O
99	concentration-time	O
100	curve	O
101	of	O
102	5	O
103	.	O
104	The	O
105	treatments	O
106	were	O
107	repeated	O
108	every	O
109	3	O
110	weeks	O
111	for	O
112	six	O
113	cycles	O
114	;	O
115	in	O
116	responding	O
117	patients	O
118	,	O
119	an	O
120	additional	O
121	three	O
122	cycles	O
123	of	O
124	single-agent	O
125	carboplatin	O
126	was	O
127	permitted	O
128	.	O
129	Survival	O
130	curves	O
131	wer

### Building Vocabulary and BIO Labels
This cell extracts the vocabulary (all unique words) and BIO labels (annotations) from the training data. The tokens `<unk>` and `<pad>` are added to handle unknown words and padding.


In [1]:
def build_vocab_and_labels(data):
    vocab = set()
    bio_labels = set()

    for doc in data:
        for paragraph in doc['tokens']:
            for token in paragraph:
                vocab.add(token['str'])
                bio_labels.add(token['arg'])

    # add unknown and padding tokens
    vocab = ["<unk>", "<pad>"] + sorted(vocab)
    bio_labels = sorted(bio_labels)

    return vocab, bio_labels

### Naive Tokenizer
This `NaiveTokenizer` class is a basic implementation designed for preparing input data for training and evaluation. It converts words and BIO labels into numerical indices for the model, handles padding, and decodes outputs, ensuring smooth interaction with the neural network.

In [2]:
import torch

class NaiveTokenizer:
    def __init__(self, vocab, bio_labels):
        self.word2idx = {word: idx for idx, word in enumerate(vocab)}
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}

        self.label2idx = {label: idx for idx, label in enumerate(bio_labels)}
        self.idx2label = {idx: label for label, idx in self.label2idx.items()}

        self.pad_token = self.word2idx["<pad>"]

    def encode(self, tokens):
        return [self.word2idx.get(token, self.word2idx["<unk>"]) for token in tokens]

    def decode(self, indices):
        return [self.idx2word[idx] for idx in indices]

    def encode_bio_labels(self, labels):
        return [self.label2idx[label] for label in labels]

    def decode_bio_labels(self, indices):
        return [self.idx2label[idx] for idx in indices]

    def pad_batch(self, batch, pad_token):
        max_len = max(len(seq) for seq in batch)
        return torch.tensor([seq + [pad_token] * (max_len - len(seq)) for seq in batch])

    def collate_fn(self, batch):
        """
        Custom collate function for DataLoader.
        Args:
          batch: List of samples from the dataset.
        Returns:
          input_ids_padded: Padded tensor of input IDs.
          bio_labels_padded: Padded tensor of BIO labels.
        """
        input_ids = [sample["input_ids"] for sample in batch]
        bio_labels = [sample["bio_labels"] for sample in batch]

        input_ids_padded = self.pad_batch(input_ids, self.pad_token)
        bio_labels_padded = self.pad_batch(bio_labels, self.pad_token)

        return input_ids_padded, bio_labels_padded

### Dataset Construction
This `ArgumentationDataset` class processes the raw input data and prepares it for the model. Each document is parsed into paragraphs, and tokens within these paragraphs are encoded into numerical indices using the `NaiveTokenizer`. The resulting dataset provides both input IDs (`input_ids`) and their corresponding BIO labels (`bio_labels`), ready for batching and training.

In [3]:
from torch.utils.data import Dataset

class ArgumentationDataset(Dataset):
    def __init__(self, data, tokenizer):
        """
        Converts raw argumentation data into a PyTorch-compatible dataset.
        Each data sample contains tokenized input IDs and their corresponding BIO labels,
        making it suitable for sequence labeling tasks.
        """
        self.data = []
        for doc in data:
            for paragraph in doc['tokens']:
                token_ids = tokenizer.encode([token['str'] for token in paragraph])
                bio_labels = tokenizer.encode_bio_labels([token['arg'] for token in paragraph])

                self.data.append({
                    "input_ids": token_ids,
                    "bio_labels": bio_labels
                })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

### NaiveBIOModel: BIO Tagging for Argument Mining
This model predicts BIO tags (B-Claim, I-Premise, O, etc.) for tokens in a sequence. It uses token embeddings, an LSTM layer for context, and a linear layer to output predictions.

In [4]:
import torch.nn as nn

class NaiveBIOModel(nn.Module):
    def __init__(self, emb_size, vocab_size, bio_size, pad_id, num_layers=1):
        """
        Model for predicting BIO annotations.
        Args:
          emb_size (int): Size of the embeddings.
          vocab_size (int): Vocabulary size.
          bio_size (int): Number of BIO classes.
          pad_id (int): ID of the padding token.
          num_layers (int): Number of LSTM layers.
        """
        super().__init__()
        self.pad_id = pad_id

        # Tokens embeddings
        self.word_embedding = nn.Embedding(vocab_size, emb_size, padding_idx=pad_id)

        # LSTM for sequential context
        self.lstm = nn.LSTM(emb_size, emb_size, batch_first=True, num_layers=num_layers)

        # Dropout
        self.dropout = nn.Dropout(0.3)

        # Couche linéaire pour prédire les classes BIO
        self.bio_out = nn.Linear(emb_size, bio_size)

    def forward(self, x):
        """
        Forward pass.
        Args:
          x (torch.Tensor): Sequences of encoded tokens (batch, seq_len).
        Returns:
          logits (torch.Tensor): Logits for BIO classes (batch, seq_len, bio_size).
        """
        x_emb = self.word_embedding(x)

        x_emb = self.dropout(self.word_embedding(x))

        lstm_out, _ = self.lstm(x_emb)

        lstm_out = self.dropout(lstm_out)

        logits = self.bio_out(lstm_out)

        return logits


### Training Function with Validation

This function trains the model using a training dataset and evaluates it on a validation dataset after each epoch to support a stopping criterion when training, helping to avoid overfitting and allowing for adjustments to hyperparameters if needed.



In [11]:
def train_model(model, train_dataloader, dev_dataloader, epochs, device="cpu", lr=0.005):
    """
    Train the model and evaluate on the validation set after each epoch.
    Args:
      model (nn.Module): The model to train.
      train_dataloader (DataLoader): DataLoader for the training dataset.
      dev_dataloader (DataLoader): DataLoader for the validation dataset.
      epochs (int): Number of epochs to train.
      device (str): Training device ('cpu' or 'cuda').
      lr (float): Learning rate.
    """

    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-3)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.2, patience=2
    )

    # Define the loss function with class weights
    criterion = nn.CrossEntropyLoss(ignore_index=model.pad_id)

    best_val_loss = float('inf')
    patience = 3
    patience_counter = 0

    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0

        for inputs, labels in train_dataloader:
            inputs, labels = inputs.to(device), labels.to(device)

            # Forward pass
            logits = model(inputs)
            batch_size, seq_len, bio_size = logits.shape
            loss = criterion(
                logits.view(batch_size * seq_len, bio_size),
                labels.view(-1)
            )

            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        train_loss /= len(train_dataloader)

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for inputs, labels in dev_dataloader:
                inputs, labels = inputs.to(device), labels.to(device)

                # Forward pass
                logits = model(inputs)
                batch_size, seq_len, bio_size = logits.shape
                loss = criterion(
                    logits.view(batch_size * seq_len, bio_size),
                    labels.view(-1)
                )

                val_loss += loss.item()

        val_loss /= len(dev_dataloader)
        print(f"Epoch {epoch+1}/{epochs}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")

        # Update the learning rate scheduler
        scheduler.step(val_loss)

        # Early Stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            torch.save(model.state_dict(), "best_model.pth")  # Save the best model
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping: Validation loss did not improve.")
                break

### BIO Prediction Function
This function predicts BIO labels for a given test dataset using a trained model. It processes each document and paragraph in the test data, encodes the tokens, and uses the model to generate predictions. The predicted labels are then decoded back into human-readable BIO format, enabling the reconstruction of a JSON file, which we will demonstrate in the following steps.

In [7]:
def predict_bio(model, tokenizer, test_data, device="cpu"):
    """
    Predict BIO annotations for a test dataset.
    Args:
      model (nn.Module): Trained model.
      tokenizer (NaiveTokenizer): Tokenizer used for encoding and decoding.
      test_data (list): Test dataset in JSON format.
      device (str): Device for inference ('cpu' or 'cuda').
    Returns:
      list: List of BIO predictions for each paragraph in the test dataset.
    """
    model.eval()
    model.to(device)
    predictions = []

    with torch.no_grad():
        for doc in test_data:
            doc_predictions = []
            for paragraph in doc['tokens']:
                token_ids = tokenizer.encode([token['str'] for token in paragraph])
                input_tensor = torch.tensor([token_ids]).to(device)

                logits = model(input_tensor)
                bio_preds = torch.argmax(logits, dim=2).squeeze(0).tolist()
                bio_labels = tokenizer.decode_bio_labels(bio_preds)
                doc_predictions.append(bio_labels)

            predictions.append(doc_predictions)

    return predictions

### JSON Reconstruction
This section reconstructs a JSON file that includes tokens and spans derived from the model's predictions, while preserving the original relations from the dataset.

In [9]:
import json

def create_predictions_json(predictions, test_data, output_filepath, device="cpu"):
    """
    This function reconstructs a JSON file from model predictions while preserving the original relations.
    It updates the tokens and spans using the predicted BIO labels, but keeps the relations from the test data unchanged.

    Args:
        predictions (list): List of predicted BIO labels for each document.
        test_data (list): Original test data in JSON format.
        output_filepath (str): Filepath where the reconstructed JSON will be saved.
        device (str): Device used for processing ('cpu' or 'cuda').
    """
    # Initialize the output data structure
    output_data = []
    for doc, doc_preds in zip(test_data, predictions):
        # Initialize a new document structure for predictions
        pred_doc = {"tokens": [], "spans": [], "rels": doc["rels"]}  # Keep the original relations

        for paragraph, bio_preds in zip(doc['tokens'], doc_preds):
            # Reconstruct tokens with predicted BIO labels
            reconstructed_paragraph = []
            for token, bio_label in zip(paragraph, bio_preds):
                token["arg"] = bio_label  # Assign predicted label to each token
                reconstructed_paragraph.append(token)

            pred_doc["tokens"].append(reconstructed_paragraph)

            # Reconstruct spans based on predicted BIO labels
            spans = reconstruct_spans(reconstructed_paragraph, bio_preds)
            if spans:  # Add spans if any were found
                pred_doc["spans"].extend(spans)

        # Append the reconstructed document to the output data
        output_data.append(pred_doc)

    # Save the reconstructed JSON file to the specified path
    with open(output_filepath, "w") as f:
        json.dump(output_data, f, indent=4)

    print(f"Complete JSON file saved at: {output_filepath}")


def reconstruct_spans(paragraph, bio_labels):
    """
    Reconstruct spans from predicted BIO labels.
    Args:
      paragraph (list): List of tokens in the paragraph.
      bio_labels (list): Predicted BIO labels for the tokens.
    Returns:
      list: A list of reconstructed spans, each represented as a dictionary.
    """
    spans = []  # Initialize an empty list to store spans
    current_span = None  # Variable to hold the current span being constructed

    for idx, label in enumerate(bio_labels):
        if label.startswith("B-"):
            # If a new span starts, close the previous one (if it exists)
            if current_span:
                spans.append(current_span)
            # Create a new span
            current_span = {
                "name": label[2:],  # Extract the type (e.g., "Claim", "Premise") by removing "B-"
                "start": paragraph[idx]["idx"],  # Start index of the span
                "end": paragraph[idx]["idx"]  # Initialize the end index as the start index
            }
        elif label.startswith("I-") and current_span and current_span["name"] == label[2:]:
            # Extend the current span if it matches the "I-" label
            current_span["end"] = paragraph[idx]["idx"]
        else:
            # Close the current span if the label doesn't match "I-*" or if it's "O"
            if current_span:
                current_span["end"] = paragraph[idx]["idx"]  # Include the first non-I-* or O as the end index
                spans.append(current_span)
                current_span = None

    # Add the last span if it wasn't closed
    if current_span:
        spans.append(current_span)

    return spans

### Essay Dataset Pipeline
This pipeline uses the `aae_*.json datasets`, which focuses on argumentative structures within essays. The model is trained, validated, and tested on this dataset. It demonstrates the model's ability to capture argumentative spans and relationships in natural language texts.


In [12]:
import json
from torch.utils.data import DataLoader
from pathlib import Path

# Dataset Paths
data_folder = Path("data")
train_file_path = data_folder / "aae_train.json"
validation_file_path = data_folder / "aae_dev.json"
test_file_path = data_folder / "aae_test.json"

# Load the training data for the essay dataset (aae_train.json)
with open(train_file_path, "r") as f:
    train_data = json.load(f)

# Build the vocabulary and BIO labels from the training data
vocab, bio_labels = build_vocab_and_labels(train_data)
print("Vocabulary size:", len(vocab))
print("BIO Labels:", bio_labels)

# Initialize the naive tokenizer using the constructed vocabulary and BIO labels
tokenizer = NaiveTokenizer(vocab, bio_labels)

# Instantiate the dataset and DataLoader for the training set
train_dataset = ArgumentationDataset(train_data, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=tokenizer.collate_fn)

# Load the validation data (aae_dev.json)
with open(validation_file_path, "r") as f:
    dev_data = json.load(f)

# Create a DataLoader for the validation dataset
dev_dataset = ArgumentationDataset(dev_data, tokenizer)
dev_dataloader = DataLoader(dev_dataset, batch_size=64, shuffle=False, collate_fn=tokenizer.collate_fn)

# Instantiate the NaiveBIOModel with appropriate parameters
model = NaiveBIOModel(emb_size=128, vocab_size=len(vocab), bio_size=len(bio_labels), pad_id=tokenizer.word2idx["<pad>"])

# Train the model using the training and validation datasets
train_model(model, train_dataloader, dev_dataloader, epochs=10, device="cpu")

# Load the test data for evaluation (aae_test.json)
with open(test_file_path, "r") as f:
    test_data = json.load(f)

# Generate BIO predictions for the test dataset
predictions = predict_bio(model, tokenizer, test_data, device="cpu")


Vocabulary size: 7528
BIO Labels: ['B-Claim', 'B-MajorClaim', 'B-Premise', 'I-Claim', 'I-MajorClaim', 'I-Premise', 'O']
Epoch 1/10, Training Loss: 1.2377, Validation Loss: 1.0624
Epoch 2/10, Training Loss: 0.9701, Validation Loss: 0.9719
Epoch 3/10, Training Loss: 0.7951, Validation Loss: 0.8495
Epoch 4/10, Training Loss: 0.6690, Validation Loss: 0.8334
Epoch 5/10, Training Loss: 0.5837, Validation Loss: 0.8419
Epoch 6/10, Training Loss: 0.5050, Validation Loss: 0.8139
Epoch 7/10, Training Loss: 0.4501, Validation Loss: 0.8273
Epoch 8/10, Training Loss: 0.4172, Validation Loss: 0.8810
Epoch 9/10, Training Loss: 0.3856, Validation Loss: 0.8634
Early stopping: Validation loss did not improve.


In [13]:
# Save the complete predictions to a JSON file
create_predictions_json(predictions, test_data, "aae_predictions_naive.json", device="cpu")

!python evaluate.py aae_predictions_naive.json aae_test.json

Complete JSON file saved at: aae_predictions_naive.json


********************** SPANS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 0.5684005273225203 
      Recall    : 0.45385944323278926
      F-score   : 0.4992141273034837
    > Argument mining spans (labeled)
      Precision : 0.5021464597526635 
      Recall    : 0.39722480285125195 
      F-score   : 0.43875376756198553

    RELAXED EVALUATION (α = 0.5)
    > Argument mining spans (unlabeled)
      Precision : 0.6387572251845088 
      Recall    : 0.5927761954090096
      F-score   : 0.6047907513686014
    > Argument mining spans (labeled)
      Precision : 0.5697982327755062 
      Recall    : 0.5143077550395093 
      F-score   : 0.5319557193539048



******************* RELATIONS *************************** 
   STRICT EVALUATION
    > Argument mining spans (unlabeled)
      Precision : 1.0 
      Recall    : 1.0
      F-score   : 1.0
    > Argument mining spans (l


### Medical Dataset Pipeline
This pipeline works with the `abstrct_*.json` datasets, which focuses on argumentative structures within medical abstracts. Similar to the essay dataset, the model is trained, validated, and tested on this dataset. The medical dataset offers unique challenges due to its domain-specific language and structure, enabling a comprehensive evaluation of the model's generalizability.







In [None]:
import json
from torch.utils.data import DataLoader
from pathlib import Path

# Dataset Paths
data_folder = Path("data")
train_file_path = data_folder / "train_data.json"
validation_file_path = data_folder / "validation_data.json"
test_file_path = data_folder / "test_data.json"

# Load the training data for the medical dataset (abstrct_neoplasm_train.json)
with open(train_file_path, "r") as f:
    train_data = json.load(f)

# Build the vocabulary and BIO labels from the training data
vocab, bio_labels = build_vocab_and_labels(train_data)
print("Vocabulary size:", len(vocab))
print("BIO Labels:", bio_labels)

# Initialize the naive tokenizer using the constructed vocabulary and BIO labels
tokenizer = NaiveTokenizer(vocab, bio_labels)

# Instantiate the dataset and DataLoader for the training set
train_dataset = ArgumentationDataset(train_data, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=tokenizer.collate_fn)

# Load the validation data (abstrct_neoplasm_dev.json)
with open(validation_file_path, "r") as f:
    dev_data = json.load(f)

# Create a DataLoader for the validation dataset
dev_dataset = ArgumentationDataset(dev_data, tokenizer)
dev_dataloader = DataLoader(dev_dataset, batch_size=64, shuffle=False, collate_fn=tokenizer.collate_fn)

# Instantiate the NaiveBIOModel with appropriate parameters
model = NaiveBIOModel(emb_size=128, vocab_size=len(vocab), bio_size=len(bio_labels), pad_id=tokenizer.word2idx["<pad>"])

# Train the model using the training and validation datasets
train_model(model, train_dataloader, dev_dataloader, epochs=8, device="cpu")

# Load the test data for evaluation (abstrct_neoplasm_test.json)
with open(test_file_path, "r") as f:
    test_data = json.load(f)

# Generate BIO predictions for the test dataset
predictions = predict_bio(model, tokenizer, test_data, device="cpu")

Vocabulary size: 1272
BIO Labels: ['B-Claim', 'B-MajorClaim', 'B-Premise', 'I-Claim', 'I-MajorClaim', 'I-Premise', 'O']
Epoch 1/8 : Training Loss = 1.9865
Epoch 1/8 : Validation Loss = 1.8065
Epoch 2/8 : Training Loss = 1.8275
Epoch 2/8 : Validation Loss = 1.6035
Epoch 3/8 : Training Loss = 1.6469
Epoch 3/8 : Validation Loss = 1.3057
Epoch 4/8 : Training Loss = 1.3738
Epoch 4/8 : Validation Loss = 1.1033
Epoch 5/8 : Training Loss = 1.0816
Epoch 5/8 : Validation Loss = 1.1936
Epoch 6/8 : Training Loss = 1.1272
Epoch 6/8 : Validation Loss = 1.1625
Epoch 7/8 : Training Loss = 1.1025
Epoch 7/8 : Validation Loss = 1.0626
Epoch 8/8 : Training Loss = 0.9630
Epoch 8/8 : Validation Loss = 1.0427


We encountered a "division by zero" error because the model fails to predict any B labels. As a result, the predictions consist only of O and mainly
I-Premise labels, which means no spans can be generated. Since spans are constructed starting with a B label, the absence of such labels leads to an empty spans list. Consequently, when we attempt to evaluate the predictions using `evaluate.py`, the absence of spans results in a division by zero during the computation of evaluation metrics.

In [None]:
# Save the complete predictions to a JSON file
create_predictions_json(predictions, test_data, "abstrct_predictions.json", device="cpu")

!python evaluate.py abstrct_predictions.json abstrct_neoplasm_test.json

Complete JSON file saved at: abstrct_predictions.json
[error] division by zero


In [None]:
!python view_data.py abstrct_predictions.json

0	Imatinib	O
1	(	O
2	Gleevec	O
3	)	O
4	,	O
5	a	O
6	highly	O
7	effective	O
8	specific	O
9	tyrosine	O
10	kinase	O
11	inhibitor	O
12	,	O
13	demonstrates	O
14	a	O
15	better	O
16	side	O
17	effect	O
18	profile	O
19	than	O
20	interferon-alpha	O
21	(	O
22	IFN	O
23	)	O
24	,	O
25	which	O
26	impairs	O
27	patients	O
28	'	O
29	quality	O
30	of	O
31	life	O
32	(	O
33	QoL	O
34	)	O
35	.	O
36	This	O
37	phase	O
38	III	O
39	international	O
40	study	O
41	evaluated	O
42	QoL	O
43	outcomes	O
44	in	O
45	1,106	O
46	newly	O
47	diagnosed	O
48	patients	O
49	with	O
50	chronic-phase	O
51	chronic	O
52	myeloid	O
53	leukemia	O
54	(	O
55	CML	O
56	)	O
57	who	O
58	were	O
59	randomized	O
60	to	O
61	receive	O
62	either	O
63	imatinib	O
64	400	O
65	mg	O
66	daily	O
67	or	O
68	IFN	O
69	up	O
70	to	O
71	5	O
72	MU	O
73	/	O
74	m	O
75	(	O
76	2	O
77	)	O
78	/	O
79	d	O
80	with	O
81	cytarabine	O
82	(	O
83	Ara-C	O
84	)	O
85	20	O
86	mg	O
87	/	O
88	m	O
89	(	O
90	2	O
91	)	O
92	/	O
93	d	O
94	added	O
95	for	O
96	10	O
97	days	O
98	every	O
99	mo