Data for this project consists of two tables in a tab-separated columns format. Each row in those files corresponds to an abstract of a scientific article from ACM Digital Library, which was assigned to one or more topics from the  ACM Computing Classification System (the old one from 1998).

There are two data sets for this task, the training and the testing sample, respectively. They are text (TSV - tab separated values) files compressed using 7zip.

The training data (DM2023_training_docs_and_labels.tsv) has three columns: the first one is an identifier of a document, the second one stores the text of the abstract, and the third one contains a list of comma-separated topic labels.

The test data (DM2023_test_docs.tsv) has a similar format, but the labels in the third column are missing.

**The task and the format of submissions:** the task for you is to predict the labels of documents from the test data and submit them to the moodle using the link below. A correctly formatted submission should be a text file with exactly 100000 lines plus the report. Each line should correspond to a document from the test data set (the order matters!) and contain a list of one or more predicted labels, separated by commas. The report can be in the form of R/Python notebook (with code and computation outcomes). Please remember about explanations and visualizations – make this report as interesting for a reader as you can.

You may make several submissions (up to 20), so please remember to clearly mark the final version of your answer in case there is more than one.

Practical note: The submission size in moodle is limited to 512MB. In case your files are larger please use compression (7z,gz, ...) other than Zip. Moodle does not like .zip files. 

Evaluation: the quality of submissions will be evaluated using a script to compute the average F1-score measure, i.e., for each test document, the F1-score between the predicted and true labels will be computed, and the values obtained for all test cases will be averaged.

The deadline for sending the reports is Sunday, June 15.

Good luck!

# Introduction

Categorizing documents in **ACM Digital Library** example: 

(Main Class).(Subclass).(Subsubcategory)

**H.3.5**

* H. Information Systems
    *  H.3 Information Storage and Retrieval
        * H.3.5 Online Information Services

**D.3.2**

* D. Software
    * D.3 Programming Languages
        * D.3.2 Language Classifications
  

# Loading data

In [None]:
# import sklearn as sk
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords


nltk.download("stopwords")
stop_words = stopwords.words("english")

test = pd.read_csv("data/DM2023_test_docs.tsv", 
                    sep="\t", 
                    encoding="latin1", 
                    header=None,
                    names=["Textfile", "Text", "Topics"])
test = test.drop_duplicates()
                    
                    
train = pd.read_csv("data/DM2023_training_docs_and_labels.tsv", 
                    sep="\t", 
                    encoding="latin1", 
                    header=None,
                    names=["Textfile", "Text", "Topics"])


def flatten_if_single(x):
    """Jeśli x jest listą długości 1 – zwróć jej pierwszy element."""
    if isinstance(x, list) and len(x) == 1:
        return x[0]
    return x

# Separating topics
train["Topics"] = (
    train["Topics"]
    .apply(flatten_if_single)        
    .str.split(r"\s*,\s*")         
)



# train["Topics"] = train["Topics"].str.split(",")

unique_labels = set(label for sublist in train["Topics"] for label in sublist)

print(f"Number of unique topics: {len(unique_labels)}")
print("First 10 example topics: ",sorted(list(unique_labels))[:10])

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/konstanty/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Liczba unikalnych tematów: 358
First 10 example topics:  ['A.0', 'A.1', 'A.2', 'A.m', 'B.0', 'B.1', 'B.1.0', 'B.1.1', 'B.1.2', 'B.1.3']


# Train LDA, MLB (Or load) and topic distribution 
(Shape of mlb binary matrix should match the number of unique topics)

In [None]:
# Sparse matrix with col=words, row=word count in all documents
import pickle
import os.path
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
os.makedirs("models", exist_ok=True)


if os.path.exists("models/lda_model.pkl") and os.path.exists("models/vectorizer.pkl") and os.path.exists("models/mlb_model.pkl"):

    print("Found LDA and vectorizer models!")
    
    with open("models/lda_model.pkl", "rb") as f:
        lda = pickle.load(f)

    with open("models/vectorizer.pkl", "rb") as f:
        vectorizer = pickle.load(f)

    with open("models/mlb_model.pkl", "rb") as f:
        mlb = pickle.load(f)

    X_train = vectorizer.transform(train["Text"])
    X_train = lda.transform(X_train)
    y_train = mlb.transform(train["Topics"])
    print(f"Number of unique topics (ex. 'A.5' or 'H.3.5'): {len(mlb.classes_)}")
    print(f"Shape y_train: {y_train.shape}")

    X_test = vectorizer.transform(train["Text"])
    X_test = lda.transform(X_test)

else:
    print("We need to train LDA, MLB and Vectorizer...")
    vectorizer = CountVectorizer(stop_words=stop_words, max_df=0.95, min_df=2)
    lda = LatentDirichletAllocation(n_components=50, random_state=42)
    mlb = MultiLabelBinarizer()

    X_train = vectorizer.fit_transform(train["Text"])
    X_train = lda.fit_transform(X_train)
    y_train = mlb.fit_transform(train["Topics"])

    print(f"Number of unique topics (ex. 'A.5' or 'H.3.5'): {len(mlb.classes_)}")
    print(f"Shape y_train: {y_train.shape}")
    
    X_test = vectorizer.transform(train["Text"])
    X_test = lda.transform(X_test)

    # Save models
    with open("models/lda_model.pkl", "wb") as f:
        pickle.dump(lda, f)

    with open("models/vectorizer.pkl", "wb") as f:
        pickle.dump(vectorizer, f)

    with open("models/mlb_model.pkl", "wb") as f:
        pickle.dump(mlb, f)


# print(mlb.classes_) 

We need to train LDA, MLB and Vectorizer...
Number of unique topics (ex. 'A.5' or 'H.3.5'): 358
Shape y_train: (100000, 358)
['A.0' 'A.1' 'A.2' 'A.m' 'B.0' 'B.1' 'B.1.0' 'B.1.1' 'B.1.2' 'B.1.3'
 'B.1.4' 'B.1.5' 'B.1.m' 'B.2' 'B.2.0' 'B.2.1' 'B.2.2' 'B.2.3' 'B.2.4'
 'B.2.m' 'B.3' 'B.3.0' 'B.3.1' 'B.3.2' 'B.3.3' 'B.3.4' 'B.3.m' 'B.4'
 'B.4.0' 'B.4.1' 'B.4.2' 'B.4.3' 'B.4.4' 'B.4.5' 'B.4.m' 'B.5' 'B.5.0'
 'B.5.1' 'B.5.2' 'B.5.3' 'B.5.m' 'B.6' 'B.6.0' 'B.6.1' 'B.6.2' 'B.6.3'
 'B.6.m' 'B.7' 'B.7.0' 'B.7.1' 'B.7.2' 'B.7.3' 'B.7.m' 'B.8' 'B.8.0'
 'B.8.1' 'B.8.2' 'B.8.m' 'B.m' 'C.0' 'C.1' 'C.1.0' 'C.1.1' 'C.1.2' 'C.1.3'
 'C.1.4' 'C.1.m' 'C.2' 'C.2.0' 'C.2.1' 'C.2.2' 'C.2.3' 'C.2.4' 'C.2.5'
 'C.2.6' 'C.2.m' 'C.3' 'C.4' 'C.5' 'C.5.0' 'C.5.1' 'C.5.2' 'C.5.3' 'C.5.4'
 'C.5.5' 'C.5.m' 'C.m' 'D.0' 'D.1' 'D.1.0' 'D.1.1' 'D.1.2' 'D.1.3' 'D.1.4'
 'D.1.5' 'D.1.6' 'D.1.7' 'D.1.m' 'D.2' 'D.2.0' 'D.2.1' 'D.2.10' 'D.2.11'
 'D.2.12' 'D.2.13' 'D.2.2' 'D.2.3' 'D.2.4' 'D.2.5' 'D.2.6' 'D.2.7' 'D.2.8'
 'D.2.9' 'D

# Check 5 topics distributions

In [8]:
file_ids = train["Textfile"].iloc[:5].values
topic_distributions = X_train[:5]

topics_df = pd.DataFrame(np.round(topic_distributions, 3),
                                   columns=[f"Topic {i}" for i in range(lda.n_components)],
                                   index=file_ids)
topics_df.T

Unnamed: 0,580106.txt,1755942.txt,1416298.txt,1516665.txt,1259693.txt
Topic 0,0.0,0.0,0.0,0.0,0.0
Topic 1,0.0,0.0,0.019,0.0,0.467
Topic 2,0.0,0.0,0.0,0.0,0.0
Topic 3,0.018,0.196,0.0,0.0,0.171
Topic 4,0.0,0.0,0.0,0.149,0.0
Topic 5,0.0,0.0,0.0,0.0,0.0
Topic 6,0.0,0.0,0.0,0.0,0.0
Topic 7,0.0,0.0,0.0,0.0,0.014
Topic 8,0.0,0.01,0.3,0.0,0.0
Topic 9,0.0,0.0,0.0,0.0,0.014


# Let's see words assigned to different topics with LDA

In [9]:
feature_names = vectorizer.get_feature_names_out()

def show_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        print(f"Topic {topic_idx}: {' '.join(top_features)}")

show_top_words(lda, feature_names)

Topic 0: design hardware architecture processor performance embedded system level instruction designs
Topic 1: user virtual system interface interaction interactive visual visualization display using
Topic 2: translation optical language system english machine word wavelength reputation using
Topic 3: security key protocol secure privacy access attacks based scheme protocols
Topic 4: semantic information domain knowledge based model ontology representation concepts medical
Topic 5: scheduling time cache performance parallel execution task tasks program analysis
Topic 6: coding layer high layers compression si surface temperature thin silicon
Topic 7: functions function linear set given space paper also polynomial one
Topic 8: network networks traffic packet internet end ip bandwidth video qos
Topic 9: algorithms methods matrix matrices algorithm sparse linear iterative problems decomposition
Topic 10: database query queries xml data databases relational processing schema sql
Topic 11: 

# Training classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier


lr= LogisticRegression(
    max_iter=1000,
    solver="saga",
    n_jobs=-1
    )

# 3. Predykcja
clf = OneVsRestClassifier(lr, n_jobs=-1)
clf.fit(X_train, y_train)

y_pred_bin = clf.predict(X_test)
y_pred_labels = mlb.inverse_transform(y_pred_bin)



with open("models/classifier.pkl", "wb") as f:
    pickle.dump(clf, f)

# Test

In [None]:
# X_test = vectorizer.transform(test["Text"])
# lda_test_topic = lda.transform(X_test)


# predict_binary = classifier.predict(lda_test_topic)
# predict_labels = mlb.inverse_transform(predict_binary)

# Report
- Example topics (keywords from LDA) DONE
- PCA
- precision/recall on validation data
