# Time to slice and dice

### Slicing and Dicing

In [1]:
# Downloading remote dataset
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2024-02-06 13:22:29--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [     <=>            ]  41.00M  39.8MB/s    in 1.0s    

2024-02-06 13:22:30 (39.8 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


In [2]:
! ls

drugsCom_raw.zip  drugsComTest_raw.tsv	drugsComTrain_raw.tsv  sample_data


In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Installing collected 

In [4]:
from datasets import load_dataset
dataset_files = {'train' : 'drugsComTrain_raw.tsv', 'test' : 'drugsComTest_raw.tsv'}
drug_dataset = load_dataset('csv', data_files = dataset_files, delimiter = '\t')

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [5]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

Let's select some random ssamples from dataset for visualization purpose

In [6]:
sample_dataset = drug_dataset['train'].shuffle(seed = 42).select(range(1000))

In [7]:
sample_dataset

Dataset({
    features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
    num_rows: 1000
})

In [8]:
sample_dataset[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

Observations:
- 'Unnamed: 0' seems like unique patient ids
- 'condition' consists mix of upper and lower case characters
- 'review' contains python line seperators like \r ,\n and html character codes like &#039;

Let's verify our 1st hypothesis that 'Unnamed: 0'  is unique patient id

In [9]:
for split in drug_dataset.keys():
  assert len(drug_dataset[split]) == len(drug_dataset[split].unique('Unnamed: 0'))

Seems like our hypothesis is true :) Lets rename this column to make it more sensible. patient_id is a good name for this.

In [10]:
drug_dataset = drug_dataset.rename_column('Unnamed: 0', 'patient_id')

In [11]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

Let's find the number of unique drugs and conditions.

In [12]:
unique_conditions = drug_dataset.unique('condition')

In [13]:
print(len(unique_conditions['train']))

885


In [14]:
print(len(unique_conditions['test']))

709


Let's convert conditions to lower case characters

In [15]:
def condition_to_lowercase(example):
  return {'condition': example['condition'].lower()}

In [16]:
drug_dataset.map(condition_to_lowercase)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

AttributeError: 'NoneType' object has no attribute 'lower'

Ah shit! There are som None values. Lets remove the columns that has None

In [17]:
drug_dataset = drug_dataset.filter(lambda x: x['condition'] is not None)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

In [18]:
drug_dataset = drug_dataset.map(condition_to_lowercase)

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

In [19]:
drug_dataset['train'][:3]['condition']

['left ventricular dysfunction', 'adhd', 'birth control']

### Add New Column

Sometimes review can be just a word or 1000s of words. we need to handle this carefully. For this lets add a column containing length of reviews

In [20]:
def add_review_len(example):
  return {'review_len' : len(example['review'].split())}

In [21]:
drug_dataset = drug_dataset.map(add_review_len)
drug_dataset

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 160398
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 53471
    })
})

Lets sort data by review length to see the extrmums

In [22]:
sorted_by_review_len = drug_dataset.sort('review_len')

In [23]:
sorted_by_review_len['train'][:3]['review_len']

[1, 1, 1]

In [24]:
sorted_by_review_len['train'][-3:]['review_len']

[1107, 1162, 1894]

Let's remove examples containing reviews less than 30 words

In [25]:
drug_dataset = drug_dataset.filter(lambda x:x['review_len']>30)
drug_dataset

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 46108
    })
})

In [26]:
import numpy as np
np.max(drug_dataset['train']['review_len'])

1894

In [27]:
np.min(drug_dataset['train']['review_len'])

31

Lets remove html character present in reviews

In [28]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [29]:
drug_dataset = drug_dataset.map(lambda example: {'review': html.unescape(example['review'])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [30]:
drug_dataset['train'][:3]['review']

['"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective."',
 '"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ing

### Map function's Superpower

with batched = True mapping can happen much faster

In [31]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [32]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 1min 51s, sys: 821 ms, total: 1min 52s
Wall time: 1min 14s


In [33]:
%time tokenized_dataset_unbatched = drug_dataset.map(tokenize_function, batched=False)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 2min 16s, sys: 1.26 s, total: 2min 17s
Wall time: 2min 20s


### From Dataset s to Dataframe s and back

In [34]:
drug_dataset.set_format('pandas')

In [36]:
drug_dataset['train'][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_len
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


In [37]:
train_df = drug_dataset['train'][:]

Under the hood, ```Dataset.set_format()``` changes the return format for the dataset’s ```__getitem__()``` dunder method. This means that when we want to create a new object like train_df from a Dataset in the "pandas" format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of ```drug_dataset["train"]``` is Dataset, irrespective of the output format.

In [38]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


We can convert back to dataset from dataframe

In [39]:
from datasets import Dataset
frq_dataset = Dataset.from_pandas(frequencies)
frq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

In [40]:
frq_dataset[:3]

{'condition': ['birth control', 'depression', 'acne'],
 'frequency': [27655, 8023, 5209]}

 Computing the average rating per drug

In [48]:
average_ratings=(
    train_df[['drugName','rating']]
    .groupby('drugName')
    .mean()
    .reset_index()
    .rename(columns={"rating": "avg_rating"})
)
average_ratings.head()

Unnamed: 0,drugName,avg_rating
0,A + D Cracked Skin Relief,10.0
1,A / B Otic,10.0
2,Abacavir / dolutegravir / lamivudine,7.953488
3,Abacavir / lamivudine / zidovudine,9.0
4,Abatacept,7.3125


In [49]:
avg_rating_dataset = Dataset.from_pandas(average_ratings)
avg_rating_dataset

Dataset({
    features: ['drugName', 'avg_rating'],
    num_rows: 3052
})

In [50]:
avg_rating_dataset[:3]

{'drugName': ['A + D Cracked Skin Relief',
  'A / B Otic',
  'Abacavir / dolutegravir / lamivudine'],
 'avg_rating': [10.0, 10.0, 7.953488372093023]}

In [51]:
drug_dataset.reset_format()

### Creating validation dataset

In [53]:
clean_drug_dataset = drug_dataset['train'].train_test_split(train_size = 0.8, seed=42)
clean_drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 110811
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 27703
    })
})

In [54]:
clean_drug_dataset['validation'] = clean_drug_dataset.pop('test')
clean_drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 27703
    })
})

In [55]:
clean_drug_dataset['test'] = drug_dataset['test']

In [56]:
clean_drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 46108
    })
})

### Saving Dataset


<table>
<tr><th>Data format</th><th>Function<th></tr>
<tr><td>Arrow</td><td>Dataset.save_to_disk()</td></tr>
<tr><td>CSV</td><td>Dataset.to_csv()</td></tr>
<tr><td>JSON</td><td>Dataset.to_json()<td></tr>
</table>

saving in arrow format

In [57]:
clean_drug_dataset.save_to_disk("drug_reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

Reloading dataset

In [61]:
from datasets import load_from_disk
reloaded_dataset = load_from_disk('drug_reviews')
reloaded_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_len'],
        num_rows: 46108
    })
})

For storing in JSON we need to create seperate file for each split

In [62]:
for split, data in clean_drug_dataset.items():
  data.to_json(f'drug_reviews_{split}.jsonl')

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]