<a href="https://colab.research.google.com/github/Kiwihead15/Car_Detector/blob/main/patient_condition_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The purpouse of this script is to predict the patience's condition given its drug review.

So we will fine-tune a model with a dataset containing a hundreads of registers of pair of values (condition, review)

In [1]:
!pip install datasets evaluate transformers[sentencepiece]==4.28.0



# Import dataset

In [3]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2023-08-24 12:29:59--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [     <=>            ]  41.00M  41.1MB/s    in 1.0s    

2023-08-24 12:30:01 (41.1 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


# Using "dataset" library

In [4]:
# loading the dataset into memory
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [5]:
# checking dataset structure
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [6]:
# checking dataset features in the train branch
drug_dataset['train'].features

{'Unnamed: 0': Value(dtype='int64', id=None),
 'drugName': Value(dtype='string', id=None),
 'condition': Value(dtype='string', id=None),
 'review': Value(dtype='string', id=None),
 'rating': Value(dtype='float64', id=None),
 'date': Value(dtype='string', id=None),
 'usefulCount': Value(dtype='int64', id=None)}

In [7]:
# checking dataset samples values in the train branch
drug_dataset['train'][:3]

{'Unnamed: 0': [206461, 95260, 92703],
 'drugName': ['Valsartan', 'Guanfacine', 'Lybrel'],
 'condition': ['Left Ventricular Dysfunction', 'ADHD', 'Birth Control'],
 'review': ['"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
  '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effect

In [8]:
# Let's create avalidation dataset
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]

drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 129037
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 32260
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [9]:
# remove useless columns
drug_dataset_clean = drug_dataset_clean.remove_columns(['Unnamed: 0', 'drugName', 'rating', 'date', 'usefulCount'])

In [10]:
# check the reamain columns and its features
drug_dataset_clean['train'].features

{'condition': Value(dtype='string', id=None),
 'review': Value(dtype='string', id=None)}

In [11]:
# check unique condition values

set(drug_dataset_clean['train']['condition'])

{'0</span> users found this comment helpful.',
 '10</span> users found this comment helpful.',
 '110</span> users found this comment helpful.',
 '11</span> users found this comment helpful.',
 '121</span> users found this comment helpful.',
 '123</span> users found this comment helpful.',
 '12</span> users found this comment helpful.',
 '13</span> users found this comment helpful.',
 '142</span> users found this comment helpful.',
 '146</span> users found this comment helpful.',
 '14</span> users found this comment helpful.',
 '15</span> users found this comment helpful.',
 '16</span> users found this comment helpful.',
 '17</span> users found this comment helpful.',
 '18</span> users found this comment helpful.',
 '19</span> users found this comment helpful.',
 '1</span> users found this comment helpful.',
 '20</span> users found this comment helpful.',
 '21</span> users found this comment helpful.',
 '22</span> users found this comment helpful.',
 '23</span> users found this comment 

In [12]:
import re

In [13]:
# remove condition values useless.
drug_dataset_clean = drug_dataset_clean.filter(lambda x: re.search(r'users found this comment helpful.$', str(x["condition"])) is None)

Filter:   0%|          | 0/129037 [00:00<?, ? examples/s]

Filter:   0%|          | 0/32260 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

In [14]:
# check remain condition values
set(drug_dataset_clean['train']['condition'])

{'ADHD',
 'AIDS Related Wasting',
 'AV Heart Block',
 'Abdominal Distension',
 'Abnormal Uterine Bleeding',
 'Abortion',
 'Acetaminophen Overdose',
 'Acne',
 'Actinic Keratosis',
 'Acute Coronary Syndrome',
 'Acute Lymphoblastic Leukemia',
 'Acute Nonlymphocytic Leukemia',
 'Acute Promyelocytic Leukemia',
 "Addison's Disease",
 'Adrenocortical Insufficiency',
 'Adult Human Growth Hormone Deficiency',
 'Aggressive Behavi',
 'Agitated State',
 'Agitation',
 'Alcohol Dependence',
 'Alcohol Withdrawal',
 'Allergic Reactions',
 'Allergic Rhinitis',
 'Allergic Urticaria',
 'Allergies',
 'Alopecia',
 'Alpha-1 Proteinase Inhibitor Deficiency',
 "Alzheimer's Disease",
 'Amebiasis',
 'Amenorrhea',
 'Amyotrophic Lateral Sclerosis',
 'Anal Fissure and Fistula',
 'Anal Itching',
 'Anaphylaxis',
 'Anaplastic Astrocytoma',
 'Anaplastic Oligodendroglioma',
 'Androgenetic Alopecia',
 'Anemia',
 'Anemia Associated with Chronic Renal Failure',
 'Anemia, Chemotherapy Induced',
 'Anemia, Sickle Cell',
 'An

In [15]:
# filter records with condition values equal to None, before lower-case them

drug_dataset_clean = drug_dataset_clean.filter(lambda x: x["condition"] is not None)

Filter:   0%|          | 0/128335 [00:00<?, ? examples/s]

Filter:   0%|          | 0/32062 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53495 [00:00<?, ? examples/s]

In [16]:
# Build a function to lower-case the condition values
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset_clean = drug_dataset_clean.map(lowercase_condition)

Map:   0%|          | 0/127641 [00:00<?, ? examples/s]

Map:   0%|          | 0/31857 [00:00<?, ? examples/s]

Map:   0%|          | 0/53200 [00:00<?, ? examples/s]

In [17]:
# Check that lowercasing worked
drug_dataset_clean["train"]["condition"][:3]

['asthma, maintenance', 'pain', 'migraine']

In [18]:
# use "class_encode_column()" function to map the conditions column into ClassLabel objects. it automatically finds all the unique string values in the column and map them.
drug_dataset_clean = drug_dataset_clean.class_encode_column("condition")

Casting to class labels:   0%|          | 0/127641 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/31857 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/53200 [00:00<?, ? examples/s]

In [19]:
drug_dataset_clean['train'].features

{'condition': ClassLabel(names=['abdominal distension', 'abnormal uterine bleeding', 'abortion', 'acetaminophen overdose', 'acial lipoatrophy', 'acial wrinkles', 'acne', 'actinic keratosis', 'actor ix deficiency', 'acute coronary syndrome', 'acute lymphoblastic leukemia', 'acute nonlymphocytic leukemia', 'acute promyelocytic leukemia', "addison's disease", 'adhd', 'adrenocortical insufficiency', 'adult human growth hormone deficiency', 'aggressive behavi', 'agitated state', 'agitation', 'aids related wasting', 'ailure to thrive', 'alcohol dependence', 'alcohol withdrawal', 'allergic reactions', 'allergic rhinitis', 'allergic urticaria', 'allergies', 'alopecia', 'alpha-1 proteinase inhibitor deficiency', "alzheimer's disease", 'amebiasis', 'amenorrhea', 'amilial cold autoinflammatory syndrome', 'amilial mediterranean feve', 'amyotrophic lateral sclerosis', 'anal fissure and fistula', 'anal itching', 'anaphylaxis', 'anaplastic astrocytoma', 'anaplastic oligodendroglioma', 'androgenetic a

In [20]:
len(set(drug_dataset_clean['train']['condition']))

785

In [21]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example['review'], truncation=True)


tokenized_datasets = drug_dataset_clean.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/127641 [00:00<?, ? examples/s]

Map:   0%|          | 0/31857 [00:00<?, ? examples/s]

Map:   0%|          | 0/53200 [00:00<?, ? examples/s]

In [22]:
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['condition', 'review'],
        num_rows: 127641
    })
    validation: Dataset({
        features: ['condition', 'review'],
        num_rows: 31857
    })
    test: Dataset({
        features: ['condition', 'review'],
        num_rows: 53200
    })
})

In [30]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/127641 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/31857 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/53200 [00:00<?, ? examples/s]

In [36]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [37]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=785)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [38]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [39]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['condition', 'review', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 127641
    })
    validation: Dataset({
        features: ['condition', 'review', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 31857
    })
    test: Dataset({
        features: ['condition', 'review', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 53200
    })
})

In [None]:
# Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
tokenized_datasets = tokenized_datasets.remove_columns("review")
# Rename the column label to labels (because the model expects the argument to be named labels).
tokenized_datasets = tokenized_datasets.rename_column("condition", "labels")
# Set the format of the datasets so they return PyTorch tensors instead of lists.
tokenized_datasets.set_format("torch")


In [42]:
# We can then check that the result only has columns that our model will accept
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [35]:
trainer.train()



ValueError: ignored