Data for this project consists of two tables in a tab-separated columns format. Each row in those files corresponds to an abstract of a scientific article from ACM Digital Library, which was assigned to one or more topics from the  ACM Computing Classification System (the old one from 1998).

There are two data sets for this task, the training and the testing sample, respectively. They are text (TSV - tab separated values) files compressed using 7zip.

The training data (DM2023_training_docs_and_labels.tsv) has three columns: the first one is an identifier of a document, the second one stores the text of the abstract, and the third one contains a list of comma-separated topic labels.

The test data (DM2023_test_docs.tsv) has a similar format, but the labels in the third column are missing.

**The task and the format of submissions:** the task for you is to predict the labels of documents from the test data and submit them to the moodle using the link below. A correctly formatted submission should be a text file with exactly 100000 lines plus the report. Each line should correspond to a document from the test data set (the order matters!) and contain a list of one or more predicted labels, separated by commas. The report can be in the form of R/Python notebook (with code and computation outcomes). Please remember about explanations and visualizations – make this report as interesting for a reader as you can.

You may make several submissions (up to 20), so please remember to clearly mark the final version of your answer in case there is more than one.

Practical note: The submission size in moodle is limited to 512MB. In case your files are larger please use compression (7z,gz, ...) other than Zip. Moodle does not like .zip files. 

Evaluation: the quality of submissions will be evaluated using a script to compute the average F1-score measure, i.e., for each test document, the F1-score between the predicted and true labels will be computed, and the values obtained for all test cases will be averaged.

The deadline for sending the reports is Sunday, June 15.

Good luck!

# Introduction

Categorizing documents in **ACM Digital Library** example: 

(Main Class).(Subclass).(Subsubcategory)

**H.3.5**

* H. Information Systems
    *  H.3 Information Storage and Retrieval
        * H.3.5 Online Information Services

**D.3.2**

* D. Software
    * D.3 Programming Languages
        * D.3.2 Language Classifications
  

# Loading data

In [37]:
# import sklearn as sk
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import os.path
from sklearn.model_selection import train_test_split

notebook_dir = os.path.dirname(os.path.abspath("__file__"))
test_path = os.path.join(notebook_dir, "data", "DM2023_test_docs.tsv")
train_path = os.path.join(notebook_dir, "data", "DM2023_training_docs_and_labels.tsv")

nltk.download("stopwords")
stop_words = stopwords.words("english")



test = pd.read_csv(test_path, 
                    sep="\t", 
                    encoding="latin1", 
                    header=None,
                    names=["Textfile", "Text", "Topics"])
# test = test.drop_duplicates()
                    
                    
train_full = pd.read_csv(train_path, 
                    sep="\t", 
                    encoding="latin1", 
                    header=None,
                    names=["Textfile", "Text", "Topics"])


def flatten_if_single(x):
    """Jeśli x jest listą długości 1 – zwróć jej pierwszy element."""
    if isinstance(x, list) and len(x) == 1:
        return x[0]
    return x

# Separating topics
train_full["Topics"] = (
    train_full["Topics"]
    .apply(flatten_if_single)        
    .str.split(r"\s*,\s*")         
)

# train["Topics"] = train["Topics"].str.split(",")

unique_labels = set(label for sublist in train_full["Topics"] for label in sublist)

print(f"Number of unique topics: {len(unique_labels)}")
print("First 10 example topics: ",sorted(list(unique_labels))[:10])

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/konstanty/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of unique topics: 358
First 10 example topics:  ['A.0', 'A.1', 'A.2', 'A.m', 'B.0', 'B.1', 'B.1.0', 'B.1.1', 'B.1.2', 'B.1.3']


# Train LDA, MLB (Or load) and topic distribution 
(Shape of mlb binary matrix should match the number of unique topics)

In [None]:
# Sparse matrix with col=words, row=word count in all documents
import pickle
import os.path
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
os.makedirs("models", exist_ok=True)



val   = train_full.iloc[80_000:].reset_index(drop=True)
train = train_full.iloc[:80_000].reset_index(drop=True)

print("Train shape", train.shape," and validation shape", val.shape)

if os.path.exists("models/lda_model.pkl") and os.path.exists("models/vectorizer.pkl") and os.path.exists("models/mlb_model.pkl"):

    print("Found LDA and vectorizer models!")
    
    with open("models/lda_model.pkl", "rb") as f:
        lda = pickle.load(f)

    with open("models/vectorizer.pkl", "rb") as f:
        vectorizer = pickle.load(f)

    with open("models/mlb_model.pkl", "rb") as f:
        mlb = pickle.load(f)

    X_train = vectorizer.transform(train["Text"])
    X_train = lda.transform(X_train)
    y_train = mlb.transform(train["Topics"])

    X_val = vectorizer.transform(val["Text"])
    X_val = lda.transform(X_val)
    y_val = mlb.transform(val["Topics"])

    print(f"Number of unique topics (ex. 'A.5' or 'H.3.5'): {len(mlb.classes_)}")
    print(f"Shape y_train: {y_train.shape}")

else:
    print("We need to train LDA, MLB and Vectorizer...")
    vectorizer = CountVectorizer(stop_words=stop_words, max_df=0.95, min_df=2)
    lda = LatentDirichletAllocation(n_components=200, random_state=42)
    mlb = MultiLabelBinarizer()

    X_train = vectorizer.fit_transform(train["Text"])
    X_train = lda.fit_transform(X_train)
    y_train = mlb.fit_transform(train["Topics"])

    X_val = vectorizer.transform(val["Text"])
    X_val = lda.transform(X_val)
    y_val = mlb.transform(val["Topics"])

    print(f"Number of unique topics (ex. 'A.5' or 'H.3.5'): {len(mlb.classes_)}")
    print(f"Shape y_train: {y_train.shape}")
    
    # Save models
    with open("models/lda_model.pkl", "wb") as f:
        pickle.dump(lda, f)

    with open("models/vectorizer.pkl", "wb") as f:
        pickle.dump(vectorizer, f)

    with open("models/mlb_model.pkl", "wb") as f:
        pickle.dump(mlb, f)


X_test = vectorizer.transform(test["Text"])
X_test = lda.transform(X_test)
print(mlb.classes_) 

Train shape (80000, 3)  and validation shape (20000, 3)
We need to train LDA, MLB and Vectorizer...


# Check 5 topics distributions

In [44]:
file_ids = train["Textfile"].iloc[:5].values
topic_distributions = X_train[:5]

topics_df = pd.DataFrame(np.round(topic_distributions, 3),
                                   columns=[f"Topic {i}" for i in range(lda.n_components)],
                                   index=file_ids)
topics_df.T

Unnamed: 0,580106.txt,1755942.txt,1416298.txt,1516665.txt,1259693.txt
Topic 0,0.0,0.0,0.0,0.0,0.0
Topic 1,0.0,0.0,0.0,0.0,0.0
Topic 2,0.0,0.0,0.0,0.0,0.0
Topic 3,0.0,0.0,0.0,0.0,0.0
Topic 4,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
Topic 195,0.0,0.0,0.0,0.0,0.0
Topic 196,0.0,0.0,0.0,0.0,0.0
Topic 197,0.0,0.0,0.0,0.0,0.0
Topic 198,0.0,0.0,0.0,0.0,0.0


# Let's see words assigned to different topics with LDA

In [45]:
feature_names = vectorizer.get_feature_names_out()

def show_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        print(f"Topic {topic_idx}: {' '.join(top_features)}")

show_top_words(lda, feature_names)

Topic 0: sorting ensemble port hypercube interconnection parameterization boosting suffix ensembles transparency
Topic 1: distribution estimation statistical estimates estimate statistics sample distributions likelihood variance
Topic 2: mapping environment reality augmented environments urban real ar vr presence
Topic 3: mechanism workflow mechanisms auction workflows agreement auctions party incentive elicitation
Topic 4: air fabric injection pd pollution patch dilemma complexes simplicial catalog
Topic 5: series hybrid term model long short forecasting forecast time weather
Topic 6: management information identification tags rfid tag ir legal law compliance
Topic 7: project risk projects software practices development agile management developers best
Topic 8: process decision processes making model criteria hierarchical decisions approach selection
Topic 9: grid multimedia media mobile applications devices paper based system computing
Topic 10: codes code cyclic copy walking correct

# Training classifier

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

if os.path.exists("models/classifier.pkl"):
    print("Found classifier model!")

    with open("models/classifier.pkl", "rb") as f:
        clf = pickle.load(f)

else:
    print("We need to train classifier first...")
    lr= LogisticRegression(
        max_iter=1000,
        solver="saga",
        n_jobs=-1
        )

    # 3. Predykcja
    clf = OneVsRestClassifier(lr, n_jobs=-1)
    clf.fit(X_train, y_train)

    with open("models/classifier.pkl", "wb") as f:
        pickle.dump(clf, f)

We need to train classifier first...


ValueError: Found input variables with inconsistent numbers of samples: [80000, 100000]

# Validation

In [42]:
from sklearn.metrics import f1_score, classification_report, hamming_loss

print("Validation...")
y_pred_bin = clf.predict(X_val)
y_pred_labels = mlb.inverse_transform(y_pred_bin)

predicted_topics_list = [list(labels) for labels in y_pred_labels]
print(predicted_topics_list[:100])

val["PredictedTopics"] = predicted_topics_list

y_val_true_bin = mlb.transform(val["Topics"])
y_val_pred_bin = mlb.transform(val["PredictedTopics"])


print("micro-F1 :", f1_score(y_val_true_bin, y_val_pred_bin, average="micro"))
print("macro-F1 :", f1_score(y_val_true_bin, y_val_pred_bin, average="macro"))
print("Hamming  :", hamming_loss(y_val_true_bin, y_val_pred_bin))

Validation...


ValueError: Expected indicator for 347 classes, but got 358

# Prediction

In [None]:

print("Prediction...")
y_pred_bin = clf.predict(X_test)
y_pred_labels = mlb.inverse_transform(y_pred_bin)

# predicted_topics_str = [",".join(labels) if labels else "-" for labels in y_pred_labels]
predicted_topics_list = [list(labels) for labels in y_pred_labels]
print(predicted_topics_list[:100])

results = pd.DataFrame({
    "Textfile": test["Textfile"].values,
    "Predicted Topics": predicted_topics_list
})


# Making sure this has the same order
order = test["Textfile"]


results_sorted = (
    results.set_index("Textfile")   # <- klucz do dopasowania
           .loc[order]              # <- reindex wg referencyjnej kolejności
           .reset_index()           # <- wróć do zwykłej kolumny
)

# 3. (opcjonalnie) nadpisz `results`
results = results_sorted

print(results.head(15))

Prediction...
[[], [], [], [], [], [], [], [], ['I.2.7'], [], [], [], [], [], [], [], [], ['I.2.4'], [], [], [], [], ['F.2.2', 'G.1.6', 'I.2.8'], ['B.7.1'], ['J.3'], [], [], ['I.3.5', 'I.3.7'], [], [], ['B.7.1', 'B.7.2', 'B.8.2'], [], [], [], [], [], ['K.3.2'], [], [], [], [], [], [], [], [], [], [], [], ['K.3.2'], [], [], [], ['H.3.3'], ['H.5.2'], [], [], [], [], [], [], [], ['C.2.4', 'I.2.11'], [], [], [], ['F.2.2', 'G.2.2'], ['G.1.6'], [], [], ['G.3'], ['I.5.2'], [], [], [], [], [], [], ['C.2.1', 'C.2.2', 'C.2.3', 'C.4'], [], ['F.4.1'], [], [], [], [], [], [], [], ['H.3.3'], ['K.3.2'], [], ['K.3.1', 'K.3.2'], [], [], [], ['G.2.2'], ['C.2.1', 'C.2.2', 'C.4'], ['I.5.2'], [], ['D.3.2'], []]
       Textfile Predicted Topics
0    963168.txt               []
1   1811004.txt               []
2    192631.txt               []
3   1183872.txt               []
4   1280491.txt               []
5   1059284.txt               []
6   1133457.txt               []
7   1140350.txt               []
8  

# Test

In [None]:
# X_test = vectorizer.transform(test["Text"])
# lda_test_topic = lda.transform(X_test)


# predict_binary = classifier.predict(lda_test_topic)
# predict_labels = mlb.inverse_transform(predict_binary)

# Report
- Example topics (keywords from LDA) DONE
- PCA
- precision/recall on validation data
