In this notebook, we prepared three datasets: Wikipedia, 20 newsgroups, and Web of Sciences (WoS). 

The Wikipedia dataset was created by 1. Define the classes (sport, music, etc.); 2. Search for Wikipedia categories associated with each class; 3. For each category, we download articles from the Wikimedia API.

The [20 newsgroups](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) dataset and [WoS](https://data.mendeley.com/public-files/datasets/9rw3vkcfy4/files/c9ea673d-5542-44c0-ab7b-f1311f7d61df/file_downloaded) are available to download.

After downloading, we split each dataset to 3 parts: train, val, test. We used a sentencepeice tokenizer to preprocess and save the text to files for training models.

The Wikipedia dataset has 15 classes, 29924 documents (train 20000, val 4924, test 5000). The 20 newsgroups dataset has already been split to train and test, however we continue to split the train to form the validation part. The 20 newsgroups dataset has 20 classes, 18846 documents (train 8000, val 3314, test 7532). The Web of Sciences dataset has 3 subset, which are small, medium, and large size. Here we only choose the medium size one. The WoS dataset has 35 classes, 11967 documents (train 7197, val 2355, test 2415).

# 1. Wikipedia dataset

The Wikipedia dataset were created by the following steps:

1. Manually choose the classes (labels) of the dataset. They are should be general categories, and should not be overlap. For instace, the classes are music, film, sport, etc,. The class "artist" should not be included, as it create overlappings with other classes (i.e. an artist can be a musican, a film maker, a dancer). For example, we choose "music" as a class.

2. Manually search for Wikipedia categories realted to the chosen classes. For example, we search for categories contain the string "music" in all Wikipedia categories. We may end up with around 50 categories, including **Music**al instruments, 21-st century **music**ans, or Pop **music** group. We want to choose the categories which are representative, and have high number of articles. To do this, we use an API to request all articles belong to these categories. We count the number of articles within a category, then sort the categories by the number of articles. Then, we manually choose around 10 categories we think appropriate and representative. Write these chosen categories in a text file with the name of the class, i.e music.txt.

3. Use an API to request all articles in all chosen categories. For each article, we discard the one shorter 500 characters, and take maximum 2000 characters. We use a small trick to take the full word in the end of the string. We also store the meta data of the downloaded article, such as ID and title.

4. Repeat for other classes, such as sport, film, science, politic, etc.

5. After downloading all of the classes, we merge all articles into a table, shuffle, and split into three parts: train, val, test. We preprocess the text by the sentencepeice tokenizer.

In [108]:
import glob
import requests
import re
import pandas as pd
import numpy as np
import sentencepiece as spm

In [110]:
# read all categories in English Wikipedia
with open('./wiki_categories.txt') as f:
    all_categories = f.read().splitlines()

This function is for counting how many articles there are in a list of chosen categories:

In [109]:
def get_ID_pages(selected_categories):
    S = requests.Session()
    URL = "https://en.wikipedia.org/w/api.php"

    category_IDs = {}
    for cat in selected_categories:
        IDs = []
        PARAMS = {
                "action": "query",
                "list": "categorymembers",
                "format": "json",
                "cmtitle": "Category:"+cat,
                "cmlimit": "500"
            }

        for i in range(10):    #increase 1 to more to get additional data
            R = S.get(url=URL, params=PARAMS)
            DATA = R.json()

            PAGES = DATA["query"]["categorymembers"]

            for page in PAGES:
                IDs.append(page["pageid"])

            if "continue" in DATA:
                PARAMS["cmcontinue"] = DATA["continue"]["cmcontinue"]
            else:
                break

        category_IDs[cat] = IDs
    return category_IDs

Search for all categories containing a related string:

In [111]:
search_categories = []
for cat in all_categories:
    cat_lower = cat.lower()
    if 'motor' in cat_lower \ # related string
    # and discard all internal Wiki articles:
    and not 'wikiproject' in cat_lower and not 'wikipedia' in cat_lower and not 'articles' in cat_lower \
    and not '-importance' in cat_lower and not '-class' in cat_lower:
        search_categories.append(cat)
print(len(search_categories))
search_categories

19


['American motorcycle racers',
 'Deaths from motor neuron disease',
 'Defunct motor vehicle manufacturers of England',
 'Defunct motor vehicle manufacturers of France',
 'Defunct motor vehicle manufacturers of the United States',
 'English motorcycle racers',
 'Honda motorcycles',
 'Italian motorcycle racers',
 'Jeonbuk Hyundai Motors FC players',
 'Luxury motor vehicle manufacturers',
 'Motor vehicle company stubs',
 'Motor vehicle manufacturers based in Michigan',
 'Motor vehicles manufactured in the United States',
 'Motorcycle stubs',
 'Motorsport announcers',
 'Motorsport by year',
 'Motorsport venue stubs',
 'Standard motorcycles',
 'Yamaha motorcycles']

In [112]:
# search for all articles in the list of categories above
# store the number of articles in each category
category_IDs = get_ID_pages(search_categories)

In [113]:
# print the categories and the number of articles in descending order

stats = {}
all_IDs = []
for cat, IDs in category_IDs.items():
    stats[cat] = len(IDs)
    all_IDs += IDs

print(len(set(all_IDs)))
{k: v for k, v in sorted(stats.items(), key=lambda item: item[1], reverse=True)}

5818


{'Defunct motor vehicle manufacturers of the United States': 781,
 'Motorcycle stubs': 651,
 'English motorcycle racers': 436,
 'Defunct motor vehicle manufacturers of England': 431,
 'Motor vehicles manufactured in the United States': 410,
 'Deaths from motor neuron disease': 379,
 'Motor vehicle company stubs': 377,
 'Defunct motor vehicle manufacturers of France': 347,
 'Honda motorcycles': 306,
 'Luxury motor vehicle manufacturers': 280,
 'Standard motorcycles': 269,
 'Motorsport by year': 257,
 'American motorcycle racers': 256,
 'Motorsport venue stubs': 253,
 'Motor vehicle manufacturers based in Michigan': 245,
 'Italian motorcycle racers': 236,
 'Jeonbuk Hyundai Motors FC players': 231,
 'Motorsport announcers': 231,
 'Yamaha motorcycles': 201}

In [6]:
selected_categories = [
'Superbike World Championship riders',
'Motorcycle stubs',
'Standard motorcycles',
'Luxury motor vehicle manufacturers',
'Yamaha motorcycles',
'Honda motorcycles',
'Italian motorcycle racers',
'125cc World Championship riders',
'Motorsport venue stubs',
'250cc World Championship riders',
'American motorcycle racers',
'English motorcycle racers']

# save chosen categories to a text file
with open('./wiki_dataset_raw/category/vehicle.motorcycles.txt', 'w') as f:
    for cat in selected_categories:
        f.write("%s\n" % cat)

This function is for requesting all articles in a class. It stores the IDs and titles of articles. Later, when we have the ID or the title of an article, we can request for the page content.

In [115]:
def get_ID_title(category_label):
    S = requests.Session()
    URL = "https://en.wikipedia.org/w/api.php"

    label_ID_title = []
    for cat, label in category_label.items():
        PARAMS = {
                "action": "query",
                "list": "categorymembers",
                "format": "json",
                "cmtitle": "Category:"+cat,
                "cmlimit": "500"
            }

        for i in range(10):    #increase 1 to more to get additional data
            R = S.get(url=URL, params=PARAMS)
            DATA = R.json()

            PAGES = DATA["query"]["categorymembers"]

            for page in PAGES:
                ID = page["pageid"]
                title = page["title"]
                if title[:9] != "Category:" and title[0:5] != 'File:' \
                and title[:9] != "Template:" and title[0:5] != 'Talk:':
                    label_ID_title.append({'cat': cat,
                                          'label': label,
                                          'ID': ID,
                                          'title': title,
                                          'mark': 0})

            if "continue" in DATA:
                PARAMS["cmcontinue"] = DATA["continue"]["cmcontinue"]
            else:
                break

    return label_ID_title

In [11]:
extract_label = 'vehicle.motorcycles'
with open(f'./wiki_dataset_raw/category/{extract_label}.txt') as f:
    selected_categories = f.read().splitlines()
    
category_label = {}
for cat in selected_categories:
    category_label[cat] = extract_label
len(category_label)

12

In [117]:
df_ID_title = get_ID_title(category_label)

In [118]:
df_ID_title = pd.DataFrame(df_ID_title)
df_ID_title = df_ID_title.drop_duplicates(subset=['ID'])

In [119]:
df_ID_title

Unnamed: 0,cat,label,ID,title,mark
0,Superbike World Championship riders,vehicle.motorcycles,13196576,List of Superbike World champions,0
1,Superbike World Championship riders,vehicle.motorcycles,13196590,List of Superbike World Championship race winners,0
2,Superbike World Championship riders,vehicle.motorcycles,13199257,List of Superbike World Championship racers,0
3,Superbike World Championship riders,vehicle.motorcycles,23765474,Superbike World Championship records,0
4,Superbike World Championship riders,vehicle.motorcycles,2857412,Norifumi Abe,0
...,...,...,...,...,...
4026,English motorcycle racers,vehicle.motorcycles,44857634,Steve Worrall,0
4027,English motorcycle racers,vehicle.motorcycles,12334034,Charles Wright (speedway rider),0
4028,English motorcycle racers,vehicle.motorcycles,10428173,James Wright (speedway rider),0
4029,English motorcycle racers,vehicle.motorcycles,19461142,Doug Wyer,0


In [121]:
min_article = 2000
random_seed = 111
min_char = 500
max_char = 2021

In [122]:
# shuffle to mix categories
df_dataset = df_ID_title.sample(frac=1, random_state=random_seed)
df_dataset

Unnamed: 0,cat,label,ID,title,mark
348,Motorcycle stubs,vehicle.motorcycles,6548157,BMW K1200GT,0
142,Superbike World Championship riders,vehicle.motorcycles,35482216,Vladimir Leonov (motorcyclist),0
994,Standard motorcycles,vehicle.motorcycles,42716596,Honda CB Trigger,0
1820,Honda motorcycles,vehicle.motorcycles,8024793,Honda NR500,0
987,Standard motorcycles,vehicle.motorcycles,41222030,Hero Karizma,0
...,...,...,...,...,...
681,Motorcycle stubs,vehicle.motorcycles,67987385,Norton Model 77 Dominator,0
86,Superbike World Championship riders,vehicle.motorcycles,23412525,Jonas Folger,0
724,Motorcycle stubs,vehicle.motorcycles,1734184,Sissy bar,0
2462,125cc World Championship riders,vehicle.motorcycles,35499359,Iori Namihira,0


Request the contents of articles:

In [126]:
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"

content_file = open(f"./wiki_dataset_raw/dataset/{extract_label}/data.txt", 'w')
num_article = 0

for index, row in df_dataset.iterrows():
    PARAMS = {
        "action": "query",
        "prop": "extracts",
        "format": "json",
#         "exintro": True,
        "explaintext": True,
        "redirects": True,
        "titles": row['title']
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    PAGES = DATA["query"]["pages"]
    
    page = PAGES[list(PAGES.keys())[0]]
    extract = page["extract"].strip()
    extract = re.sub('\n*.+==\n*', ' ', extract)
    ix = max(extract.find(' ', max_char), max_char)
    extract = extract[:ix]
    if len(extract) < min_char or '{{Infobox' in extract:
        continue
    content_file.write("<doc id=\""+str(int(row['ID']))+"\" title=\""
                       +row['title']+"\" class=\""+row['label']+"\">\n")
    content_file.write(extract+'\n')
    content_file.write("</doc>\n\n")

    df_dataset.at[index,'mark'] = 1
    num_article += 1
    if num_article >= min_article:
        break
        
content_file.close()

In [127]:
df_tmp = df_dataset[df_dataset['mark'] == 1].sort_values('cat')
df_tmp

Unnamed: 0,cat,label,ID,title,mark
2240,125cc World Championship riders,vehicle.motorcycles,8303738,Kel Carruthers,1
2391,125cc World Championship riders,vehicle.motorcycles,30139598,S. Sarath Kumar,1
2343,125cc World Championship riders,vehicle.motorcycles,14849493,Manuel Herreros,1
2552,125cc World Championship riders,vehicle.motorcycles,35471964,Bryan Schouten,1
2228,125cc World Championship riders,vehicle.motorcycles,6138415,Ralph Bryans,1
...,...,...,...,...,...
1485,Yamaha motorcycles,vehicle.motorcycles,49889357,Yamaha Tracer 900,1
1463,Yamaha motorcycles,vehicle.motorcycles,10997643,Yamaha FZR600,1
1548,Yamaha motorcycles,vehicle.motorcycles,12558397,Yamaha Venture Royale,1
1602,Yamaha motorcycles,vehicle.motorcycles,62448411,Yamaha XJ1100,1


In [128]:
cats = df_tmp['cat'].unique()
for cat in cats:
    print(cat, len(df_tmp[df_tmp['cat'] == cat]))

125cc World Championship riders 118
250cc World Championship riders 83
American motorcycle racers 171
English motorcycle racers 269
Honda motorcycles 118
Italian motorcycle racers 65
Luxury motor vehicle manufacturers 205
Motorcycle stubs 416
Motorsport venue stubs 148
Standard motorcycles 146
Superbike World Championship riders 185
Yamaha motorcycles 76


In [129]:
# write meta data of articles
df_tmp.to_csv(f'./wiki_dataset_raw/dataset/{extract_label}/meta.csv', mode='w', sep=',',
              columns=['ID','title','cat','label'], index=False)

Tokenize and split the dataset:

In [10]:
sp = spm.SentencePieceProcessor()
sp.load('../spmcc.model')

True

In [51]:
meta_files = glob.glob('./wiki_dataset_raw/dataset/*/meta.csv')
df_dataset = []
for file in meta_files:
    df_dataset.append(pd.read_csv(file))
df_dataset = pd.concat(df_dataset)
df_dataset = df_dataset.drop_duplicates(subset=['ID'])
df_dataset = df_dataset.set_index('ID')
df_dataset.reset_index(inplace=True)

df_dataset['doc'] = ''
df_dataset

Unnamed: 0,ID,title,cat,label,doc
0,7288793,George Stovey,African-American baseball players,sport.baseball,
1,509850,Bill White (first baseman),African-American baseball players,sport.baseball,
2,11339464,Fenwick Watkins,African-American baseball players,sport.baseball,
3,5902083,Bernard Gilkey,African-American baseball players,sport.baseball,
4,5184591,Bubba Morton,African-American baseball players,sport.baseball,
...,...,...,...,...,...
29919,9814202,WMUZ (AM),Religious radio stations in the United States,religion,
29920,16034351,WYGG,Religious radio stations in the United States,religion,
29921,15385496,WBCI,Religious radio stations in the United States,religion,
29922,17909663,WRRE,Religious radio stations in the United States,religion,


In [52]:
data_files = glob.glob('./wiki_dataset_raw/dataset/*/data.txt')
doc = ""
for file in data_files:
    with open(file) as f:
        for l in f:        
            l = l.rstrip('\n')
            if l[:4] == "<doc":
                m = re.search(".*id=([^ ]*) ",l)
                ID = m.group(1)
                ID = int(ID.strip('"'))
            elif l[:5] == "</doc":
                ll = sp.encode_as_pieces(doc)
                doc = ' '.join([wp for wp in ll])
                
                mask = df_dataset['ID'] == ID
                pos = np.flatnonzero(mask)[0]
                df_dataset.at[pos, 'doc'] = doc
                
                doc = ""
            else:
                doc+=l+' '

In [58]:
random_seed = 111
df_dataset = df_dataset.sample(frac=1, random_state=random_seed)
df_dataset

Unnamed: 0,ID,title,cat,label,doc
22102,39576637,Willer Bordon,21st-century Italian politicians,politic,▁Will er ▁Bor don ▁(16 ▁January ▁19 49 ▁– ▁14 ...
3132,11392088,J. Evan Bonifant,American male film actors,films,"▁J . ▁E van ▁Bon if ant ▁( born ▁August ▁19, ▁..."
11994,24906027,Dave Helmick,American auto racing biography stubs,vehicle.cars,▁Dr . ▁Dave ▁Helm ick ▁is ▁a ▁former ▁American...
12431,3843119,GIC–Mixon Motorsports,American auto racing teams,vehicle.cars,▁G IC – M ix on ▁Motor sports ▁was ▁a ▁NASCAR ...
7557,11587786,Quill Corp. v. North Dakota,United States Supreme Court cases,law,"▁Qui ll ▁Corp . ▁v . ▁North ▁Dakota , ▁5 04 ▁U..."
...,...,...,...,...,...
7443,53588239,"Water Splash, Inc. v. Menon",United States Supreme Court cases,law,"▁Water ▁Sp lash , ▁Inc . ▁v . ▁Men on , ▁5 81 ..."
4182,18566725,Catherine Christer Hennix,21st-century American women musicians,music,▁Catherine ▁Christ er ▁Hen n ix ▁( al so ▁know...
4820,59329679,David Bismuth,21st-century French male musicians,music,▁David ▁Bis mu th ▁( born ▁10 ▁January ▁19 75 ...
10196,62578576,Xiao Wenjiao,Members of the Chinese Academy of Sciences,science,▁X ia o ▁We n ji ao ▁( Ch ines e : ▁ 肖 文 交 ; ▁...


Write the datasets to disk, ready to train models:

In [67]:
with open('./wikipedia/wikipedia-test.sp', 'w') as f:
    for index, row in df_dataset.iloc[25000:].iterrows():
        f.write("<doc id="+str(row['ID'])+" class="+row['label']+">\n")
        f.write(row['doc']+'\n')
        f.write("</doc>\n")

# 2. The 20newsgroups dataset

First, download and extract the [20newsgroups dataset](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) to the folder *datasets*. We will get 2 folders, one for training, the other one for testing.

Then, open a command window at this directory and use the script **wordpeice.py** to tokenize the dataset:

```bash
python wordpiece.py
```

Then we obtain two .sp files, one for training and one for testing. We further split the train part to form the validation part as below:

In [1]:
import random

In [2]:
train_text = []
chunk = ''
with open('./20news-bydate/20news-bydate-train.sp') as f:
    for line in f:
        if line[:4] == "<doc":
            chunk += line
        elif line[:5] == "</doc":
            chunk += line
            train_text.append(chunk)
            chunk = ''
        else:
            chunk += line

In [3]:
len(train_text)

11314

In [4]:
random_seed = 111
random.seed(random_seed)
random.shuffle(train_text)

In [5]:
# validation set
with open('./20news-bydate/20news-bydate-val.sp', 'w') as f:
    for doc in train_text[8000:]:
        f.writelines(doc)

# new training set
with open('./20news-bydate/20news-bydate-train.sp', 'w') as f:
    for doc in train_text[:8000]:
        f.writelines(doc)

In [10]:
# remove unuesed files and folders
shutil.rmtree('./20news-bydate/20news-bydate-train')
shutil.rmtree('./20news-bydate/20news-bydate-test')
os.remove('./20news-bydate.tar.gz')

# 3. Split and preprocess the WoS11967 dataset

First, download and extract the [WoS](https://data.mendeley.com/public-files/datasets/9rw3vkcfy4/files/c9ea673d-5542-44c0-ab7b-f1311f7d61df/file_downloaded) dataset and place it in this directory. We only use the medium-size dataset, which has 11967 documents.

Tokenize and split the WoS11967 dataset to three parts: train, val, and test as below:

In [2]:
import re
import random
import numpy as np
import sentencepiece as spm
import pathlib
import shutil
import os

pathlib.Path('./wos').mkdir(parents=True, exist_ok=True)
random.seed(99)
sp = spm.SentencePieceProcessor()
sp.load('../spmcc.model')

True

In [3]:
def output_wordpieces_wos(text_file, label_file, train_p, val_p):
    """
    train_p, val_p: the proportion of traning part and validation part
    They should be > 0 and < 1, e.g. 0.6 and 0.2 (the testing part remains 0.2)
    """
        
    outfile_train = open("./wos/wos11967-train.sp", 'w')
    outfile_val = open("./wos/wos11967-val.sp", 'w')
    outfile_test = open("./wos/wos11967-test.sp", 'w')
    n_train, n_val, n_test = 0, 0, 0 # count the number of docs for each part
    
    with open(label_file) as f:
        labels = f.readlines()
    label_idx = 0
    
    with open(text_file, encoding="utf8", errors='ignore') as f:
        for l in f:
            l = l.rstrip('\n')
            ll = sp.encode_as_pieces(l)
            label = labels[label_idx].rstrip('\n')
            
            # use random to decide the current doc belongs to train, val, or test
            is_train = random.choices([0, 1], weights=[1-train_p, train_p])[0]
            if is_train:
                outfile_train.write("<doc id="+str(label_idx)+" class="+label+">\n")
                outfile_train.write(' '.join([wp for wp in ll])+'\n')
                outfile_train.write("</doc>\n")
                n_train += 1
            else:
                is_val = random.choices([0, 1], weights=[1-val_p/(1-train_p), val_p/(1-train_p)])[0]
                if is_val:
                    outfile_val.write("<doc id="+str(label_idx)+" class="+label+">\n")
                    outfile_val.write(' '.join([wp for wp in ll])+'\n')
                    outfile_val.write("</doc>\n")
                    n_val += 1
                else:
                    outfile_test.write("<doc id="+str(label_idx)+" class="+label+">\n")
                    outfile_test.write(' '.join([wp for wp in ll])+'\n')
                    outfile_test.write("</doc>\n")
                    n_test += 1
            label_idx += 1
                    
    outfile_train.close()
    outfile_val.close()
    outfile_test.close()
    
    # write the number of docs in each part
    with open('./wos/wos11967_stat.txt', 'w') as f:
        f.write(str(n_train) + ' ' + str(n_val) + ' ' + str(n_test))

In [4]:
output_wordpieces_wos(text_file='./WebOfScience/WOS11967/X.txt',
                      label_file='./WebOfScience/WOS11967/Y.txt',
                      train_p=0.6, val_p=0.2)

In [9]:
# remove unuesed files and folders
shutil.rmtree('./WebOfScience')
os.remove('./WebOfScience.zip')