<a href="https://www.kaggle.com/code/angevalli/classify-people-by-profession-from-wikipedia?scriptVersionId=133902187" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

The goal of this lab is to classify Wikipedia abstracts about people by their professions. For example, the professions of Elvis Presley are "singer" and "actor".

=== Input ===

The input for training is a file wiki-train.json, which contains Wikipedia abstracts in the following form:
   {"title": "George_Washington",
    "summary": "George Washington was one of the ..."
    "occupations": ["yago:politician"]}

The input for testing is a file wiki-test.json, which contains Wikipedia abstracts of the same shape without the occupations:

   {"title": "Douglas_Adams",
    "summary": "Douglas Noel Adams was ..."}

=== Output ===

The output shall be a JSON file that assigns each Wikipedia abstract to a set of occupations:
   {"title": "Douglas_Adams",
    "occupations": ["Q36180", "Q28389"]}

=== Datasets ===

We provide 3 datasets:
1) a training dataset, which has the labels
2) a development dataset, which has the labels
3) a testing dataset, which does not have the labels

=== Suggestions for improving the model ===

1) Select a suitable theta value
Reference: held-out set, cross validation, grid search...

2) Use pre-trained embeddings
reference: word2vector, GloVe, FastText...

3) Add extra features
reference: stop words, part-of-speech...

4) Try other neural networks
reference: CNN, RNN, Attention, Transformer

5) Avoid overfitting
reference: regularization, dropout...

6) Adjust other parameters
reference: learning rate, batch_size, epoch, layer's dimension

In [1]:
import os
for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/wikipedia-abstracts/wiki-dev.json.gzde
/kaggle/input/wikipedia-abstracts/wiki-dev.json
/kaggle/input/wikipedia-abstracts/wiki-train.json.gzde
/kaggle/input/wikipedia-abstracts/wiki-test.json.gzde
/kaggle/input/wikipedia-abstracts/wiki-test.json/new_wiki-test.json
/kaggle/input/wikipedia-abstracts/wiki-train.json/new_wiki-train.json
/kaggle/input/glove6b200d/glove.6B.200d.txt


### Preprocessing

In [2]:
!cp /kaggle/input/wikipedia-abstracts/wiki-test.json.gzde /kaggle/working/wiki-test.json.gz
!cp /kaggle/input/wikipedia-abstracts/wiki-train.json.gzde /kaggle/working/wiki-train.json.gz
!cp /kaggle/input/wikipedia-abstracts/wiki-dev.json.gzde /kaggle/working/wiki-dev.json.gz

In [3]:
# Import the necessary modules and methods
import nltk
import numpy as np
from tqdm import tqdm

# Import functions from keras
from keras.layers import Embedding, Dense, Dropout, Flatten
from keras.models import Model, load_model
import keras.backend as K
from keras import Sequential

# Import some basic packages
import os
import sys
import json

# Download punkt  in case of error in last cell
nltk.download("punkt")

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
sys.path.insert(0,"/kaggle/working/")

# Import custom functions
from utility_script_people_classification import load_vocabulary, read_dataset, get_label, f1_score, load_data, gen_vocabulary

In [5]:
# input files
# [ train_file ] is a training dataset that contains 266K samples.
# [ test_file ] is a testing dataset that contains 200K samples. You can test your model based on this file.
# [ predict_file ] is a predicting dataset that contains 201K samples. Each sample in this file does not have occupation labels.

train_file = "/kaggle/working/wiki-train.json.gz"
test_file = "/kaggle/working/wiki-dev.json.gz"
predict_file = "/kaggle/working/wiki-test.json.gz"

# output files
# [ vocab_file ] has a word vocabulary that defines which words participate in this task.
# The default vocabulary is generated by our methods from training dataset,
# but you can create it in a way you like.
# [ model_file ] is used for store your trained model
# [ result_file ] is file that stores your predicted occupations.

vocab_file = "/kaggle/working/vocab.txt"
model_file = "/kaggle/working/my_model.h5"
result_file = "/kaggle/working/result.json"

We implement GloVe embedding

In [6]:
# Hyper-parameters: You don't have to change these, but you can.
# [ embedding_dimension ] the dimensions of word embeddings
# [ maximal_sentence_length ] the maximum length of each sentence
# [ number_of_labels ] the number of occupations
# [ epochs ] training epochs. Adjust this parameters to avoid overfitting and underfitting.
# [ batch_size ] the number of samples. It determines how many samples would be fed into your model.
# This size of this parameter also depends on how good your hardware is.
# [ theta ] A threshold to determine whether to assign a specific occupation label given a input sample.
# A suitable theta value will help your model

embedding_dimension = 200
maximal_sentence_length = 100
number_of_labels = 20
epochs = 20
batch_size = 32
theta = 0.45 # Change theta value
validation_split = 0.1

In [7]:
# Create an embedding vector linked to only the data we have with glove
# As default embedding dimension is 200, we consider here the 200 Dimensions file of GloVe Dataset
embeddings_dictionary = dict()
glove_file = open("/kaggle/input/glove6b200d/glove.6B.200d.txt", encoding="utf8")
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype="float32")
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

In [8]:
def create_model(embedding_matrix): #embedding_matrix as parameter instead of vocab
    '''
    :param vocab: a vocabulary dictionary which looks like {'python':0, 'java':1 ......}
    :return:
    '''
    model = Sequential()
    model.add(Embedding(input_dim=len(embedding_matrix), 
                        output_dim=embedding_dimension, 
                        weights=[embedding_matrix],
                        input_length=maximal_sentence_length,
                        trainable=False))
    model.add(Flatten())
    model.add(Dense(32, activation="relu"))
    model.add(Dropout(0.5))
    model.add(Dense(number_of_labels, activation="sigmoid"))
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
    return model

In [9]:
def train(debug):
    '''
    train your model.
    :param debug:whether to use a small fraction of samples
    :return:
    '''

    # prepare data
    vocab_to_id = load_vocabulary(vocab_file)
    data_x, data_y, embedding_matrix = read_dataset(train_file, vocab_to_id, maximal_sentence_length, train=True, embeddings_dictionary=embeddings_dictionary, embedding_dimension=embedding_dimension, debug=debug) # Adding embedding_matrix
    data_x_test, data_y_test, _ = read_dataset(test_file, vocab_to_id, maximal_sentence_length, train=False, debug=debug)

    # create a model
    model = create_model(embedding_matrix)
    model.summary()  

    # train
    print("start to train, data size = {a}".format(a=len(data_x)))
    model.fit(data_x, data_y, validation_split=validation_split, epochs=epochs, batch_size=batch_size, validation_data=(data_x_test, data_y_test)) # Adding validation data

    # save model
    model.save(model_file)

def evaluate_on_dev(debug):
    '''
    evaluate your model on the development dataset.

    :param debug:whether to use a small fraction of samples
    :return:
    '''

    # prepare data
    vocab_to_id = load_vocabulary(vocab_file)
    data_x, data_y, _ = read_dataset(test_file, vocab_to_id, maximal_sentence_length, train=False, debug=debug)
    raw_samples = list(load_data(test_file))
    print("start to do validation, data size = {a}".format(a=len(data_x)))
    _, id_to_labels = get_label()
    pred_labels, true_labels = list(), list()

    # load model
    model = load_model(model_file)

    # predict each sample
    for summary, label, raw in zip(data_x, data_y, raw_samples):
        result = model.predict(np.array([summary]))[0]
        pred = set([id_to_labels[i] for i, prob in enumerate(result) if prob > theta])
        true = set([id_to_labels[index] for index, e in enumerate(label) if e == 1])
        pred_labels.append(pred)
        true_labels.append(true)

        # print wrong prediction
        print("Title:" + raw.title)
        wrong_occupations = pred - true
        if len(wrong_occupations) > 0:
            print("[ wrong prediction ] this person does not have the occupations:{a}".format(a=wrong_occupations))
        missing_occupations = true - pred
        if len(missing_occupations) > 0:
            print("[ missing prediction ] your prediction miss the occupations:{b}".format(b=missing_occupations))
        print("---------------------------")

    # calculate metrics
    true_labels
    f1, precision, recall = f1_score(true_labels, pred_labels)
    print("result on validation set, f1 : {a}, precision : {b}, recall : {c}.".
          format(a=f1, b=precision, c=recall))


def predict_on_test(debug):
    '''
    :param debug: whether to use a small fraction of samples
    :return:
    '''

    # prepare data
    _, id_to_labels = get_label()
    vocab_to_id = load_vocabulary(vocab_file)
    model = load_model(model_file)
    datax, _, _= read_dataset(predict_file, vocab_to_id, maximal_sentence_length, train=False, debug=debug)
    raw_samples = list(load_data(predict_file))

    # predict
    r_f = open(result_file, "w", encoding="utf8")
    for data, raw_sample in tqdm(zip(datax, raw_samples)):
        result = model.predict(np.array([data]))[0]
        pred = [id_to_labels[i] for i, prob in enumerate(result) if prob > theta]
        r_f.write(json.dumps({
                            "title": raw_sample.title,
                            "occupations": pred
                        }) + "\n")

In [10]:
# create a vocabulary file if does not exist
if not os.path.exists(vocab_file):
    gen_vocabulary(train_file, vocab_file)

# train & evaluate & predict
# note: the switch 'debug' is True means only using a small fraction of samples, which can save time to debug your code.
# Change 'debug' to False when your want to train and test on all samples.
debug = True
train(debug=debug)

266938it [02:18, 1931.88it/s]


done! The size of vocabulary is 344677.


100it [00:00, 975.91it/s]


<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_script_people_classification.InputSample object at 0x7d45dc9e0cd0>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a90>
<utility_scr

100it [00:00, 1246.70it/s]


<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3a00>
<utility_script_people_classification.InputSample object at 0x7d45dc9e3970>
<utility_scr

In [11]:
evaluate_on_dev(debug=debug)

100it [00:00, 1273.53it/s]


<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_script_people_classification.InputSample object at 0x7d45a8340670>
<utility_script_people_classification.InputSample object at 0x7d45d84a3460>
<utility_scr

In [12]:
predict_on_test(debug=debug)

100it [00:00, 1568.77it/s]


<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_script_people_classification.InputSample object at 0x7d45def17130>
<utility_script_people_classification.InputSample object at 0x7d45def17100>
<utility_scr

0it [00:00, ?it/s]



1it [00:00,  8.85it/s]



3it [00:00, 13.77it/s]



5it [00:00, 15.37it/s]



7it [00:00, 16.19it/s]



9it [00:00, 16.87it/s]



11it [00:00, 17.13it/s]



13it [00:00, 15.93it/s]



15it [00:00, 14.61it/s]



17it [00:01, 11.83it/s]



19it [00:01, 13.30it/s]



21it [00:01, 14.36it/s]



23it [00:01, 15.28it/s]



25it [00:01, 16.03it/s]



27it [00:01, 16.73it/s]



29it [00:01, 17.31it/s]



31it [00:01, 17.46it/s]



33it [00:02, 17.85it/s]



35it [00:02, 18.19it/s]



37it [00:02, 18.67it/s]



39it [00:02, 18.61it/s]



41it [00:02, 18.56it/s]



43it [00:02, 18.70it/s]



45it [00:02, 18.85it/s]



47it [00:02, 19.02it/s]



49it [00:02, 18.49it/s]



51it [00:03, 18.47it/s]



53it [00:03, 18.68it/s]



55it [00:03, 18.76it/s]



57it [00:03, 18.67it/s]



59it [00:03, 17.76it/s]



61it [00:03, 18.12it/s]



63it [00:03, 18.37it/s]



65it [00:03, 18.58it/s]



67it [00:03, 18.47it/s]



70it [00:04, 19.14it/s]



72it [00:04, 19.31it/s]



74it [00:04, 19.34it/s]



76it [00:04, 18.78it/s]



78it [00:04, 18.24it/s]



80it [00:04, 18.34it/s]



82it [00:04, 16.18it/s]



84it [00:04, 16.66it/s]



86it [00:04, 17.37it/s]



88it [00:05, 17.86it/s]



90it [00:05, 17.88it/s]



92it [00:05, 17.75it/s]



94it [00:05, 17.58it/s]



96it [00:05, 17.48it/s]



98it [00:05, 17.92it/s]



100it [00:05, 17.33it/s]


We obtain an f1 score of 0.70, precision of 0.88 and recall of 0.58. This architecture allows us to reach a good precision level thanks to word embedding.
A high precision rate means that on the predicted labels, there are few false negatives and false positives. The recall rate is lower, meaning the rate of predicted values on total true labels is lower. Effectively, by looking at the result.json file in output, we have some missing predictions. This induces a f1_score of 0.7.

To put it in a nutshell, the model implemented does not make a lot of mistakes in prediction but it can miss to predict. Maybe a deeper network with more layers or higher number of epochs can improve performances.

