## 药物审查数据集

!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"  
!unzip drugsCom_raw.zip  
TSV 只是使用制表符而不是逗号作为分隔符的 CSV 变体，我们可以使用加载csv文件的load_dataset()函数并指定分隔符

In [2]:
from datasets import load_dataset

data_files = {"train": "./data/drugsComTrain_raw.tsv", "test": "./data/drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [2]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))  # 设置随机种子，随机抽取1000个样本
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

上述数据字典中的含义如下：
- Unnamed: 0: 每个患者的匿名ID
- condition: 健康状况标签

验证匿名ID 的数量是否与拆分后每部分中的行数匹配

In [3]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(
        drug_dataset[split].unique("Unnamed: 0"))

Unnamed: 0 列重命名为患者的id

In [4]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

使用 Dataset.map()标准化所有 condition 标签

In [11]:
# 消除数据中None的情况
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

In [14]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset = drug_dataset.map(lowercase_condition)

In [15]:
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

## 创建新数据

### 计算每条评论的单词数

In [16]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

In [17]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

### 查看极端评论

In [18]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

### 删除少于30个单词的评论

In [19]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}


In [20]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [208641, 118552, 2448],
 'drugName': ['Amlodipine / olmesartan',
  'Amoxicillin / clarithromycin / lansoprazole',
  'Emend'],
 'condition': ['high blood pressure',
  'helicobacter pylori infection',
  'nausea/vomiting, postoperative'],
 'review': ['"My blood pressure has been around 160/100. Doctor prescribed Azor 40/10. Just 4 hrs later my reading showed 120/82. I was amazed. I am now on it daily. Thanks to Azor."',
  '"I had severe vomiting and diarhoea for 3 days caused by clarythromycin. After being treated for dehydration at the hospital, clarythormycin was replaced with doxycycline, and I have no problems since."',
  '"I always get nausea and vomiting with anesthesia even when taking other anti-nausea meds. Was given Emend prior to gallbladder removal. Woke up with absolutely no nausea. Worked great for me!"'],
 'rating': [10.0, 2.0, 10.0],
 'date': ['January 19, 2015', 'February 18, 2017', 'August 3, 2016'],
 'usefulCount': [10, 4, 1],
 'review_length': [31, 31, 3

In [21]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [28]:
%time drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 6.74 s, sys: 156 ms, total: 6.89 s
Wall time: 6.97 s


In [29]:
# 加速处理，batched=True，默认批处理大小为1000
%time new_drug_dataset = drug_dataset.map(lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 206 ms, sys: 59.7 ms, total: 266 ms
Wall time: 349 ms


In [35]:
%time new_drug_dataset = drug_dataset.map(lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True, num_proc=2)

Map (num_proc=2):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 33.9 ms, sys: 44.2 ms, total: 78.1 ms
Wall time: 411 ms


In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "/Volumes/WD_BLACK/models/bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

In [37]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,  # 返回所有截断的token，超过最大长度的token会被截断，并返回被截断的多个token
    )

In [38]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

In [40]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [41]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [42]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 68876
    })
})

In [43]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

### 当设置超长截断后，一条数据变为多条可以使用下述方法来进行映射，避免因产生的多余数据使得与原始数据的列数不同，导致的报错。上述的方式是将原始的数据列名删除，然后再添加新的列名，这样会导致新的列名的数量与原始数据的列名不同，从而导致报错。

In [51]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    print(result)
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    print(sample_map)
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

## Datasets 和 DataFrames 的相互转换

In [3]:
drug_dataset.set_format("pandas")

In [4]:
drug_dataset["train"][:3]

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17


In [5]:
train_df = drug_dataset["train"][:]

In [6]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,frequency,count
0,Birth Control,28788
1,Depression,9069
2,Pain,6145
3,Anxiety,5904
4,Acne,5588


In [7]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 884
})

In [8]:
drug_dataset.reset_format()

In [9]:
drug_dataset_clean = drug_dataset["train"].train_test_split(
    train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 129037
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 32260
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [10]:
drug_dataset_clean.save_to_disk("./data/drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/129037 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/32260 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/53766 [00:00<?, ? examples/s]

In [11]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("./data/drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 129037
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 32260
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [14]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"./data/drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/130 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/33 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/54 [00:00<?, ?ba/s]

In [15]:
data_files = {
    "train": "./data/drug-reviews-train.jsonl",
    "validation": "./data/drug-reviews-validation.jsonl",
    "test": "./data/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]