# Paper Abstract Classification

In this project I decided to fetch my own data from arxiv with `arxiv api` and store them in the PSQL database on my local to save myself from `csv` files and create paper abstract classification webapp.

App has to look like this:
![](assets/website.png)

When you choose an abstract to get prediction, it will take you to a new page with prediction.

To open the web app first go to `backend/` and run 

`python main.py`.

Then in another shell or terminal tab go to `frontend/app/` and run 

`uvicorn --host 0.0.0.0 --port 8080 main:app --reload`.

Go to the link and voala!

Through the `arxiv` library, I pulled 8000 data, 2000 from each of the `ai, cv, ml and ds` categories. The reason why I kept the number of data low was to not have many problems in terms of compute while working. I still had to train some models via kaggle. It was inevitable that I would have difficulties in model training, especially since each of our inputs consisted of long texts.

Our categories:
- ai: artificial intelligence
- ml: machine learning
- cv: computer vision
- ds: data structures and algorithms

I created a python file `collect_and_store_data.py` to fetch data from arXiv. Then I created PSQL database in my local. Send the data as `train` and `test`. Therefore, I won't have to deal with csv files and use my data from database. Also I created `read_data.py` to read data from PSQL database to my notebook.

After getting data, we can work on it. When dealing with text data there is not much to look at to be honest. While the relationship between text and target may be clear to us here, we need to make sure the model understands it. So in `notebooks/data_analysis.ipynb` we take a look at some info about our data. Since I fetched the data I know that classes are balanced.

You can see train and test set sizes:

![](assets/traintestsize.png)

Through API I got the categories ordered. First 1500 is ai, next 1500 is ml and so on. So I shuffled the data. While using ML algorithms it wouldn't be a problem but for deep learning it has to be done. 

There is a visualization of class distribution:

![](assets/classdist.png)

Now, let's see the `wordcloud` of every category to observe which words are used to most per category.

![](assets/output.png)
![](assets/output2.png)
![](assets/output3.png)
![](assets/output4.png)


Each category contains at least two identical words.

* Algorithm
* Problem

Considering the theme of the topics, it is normal for these words to appear in every paper abstract. 

Most common words in ML:
- learning 2033
- data 1498
- algorithm 1218
- problem 858
- algorithms 837
- model 763

Most common words in AI:
- paper 951
- problem 772
- algorithm 690
- model 602
- approach 595
- problems 594

Most common words in CV:
- image 2039
- images 1358
- method 1155
- paper 966
- proposed 923
- based 894

Most common words in DS:
- algorithm 2290
- problem 2244
- n 1879
- time 1703
- algorithms 1102
- graph 948

In CV you can see `image` and `images` in the first two. That's not ideal since they have the same meaning. Hence we should perform lemmetization or maybe stemming to the text.

Again we can see word `algorithm` in each. Also `algorithm`. The rest are related words by each category. One thing is that in DS (Data Structures and Algorithms) you see k, n, 1. My guess is that DS topics usually contains this kinda reprentation letter for values. So seeing these letters is not that weird. 1 is also might be about dimensions. I bet there are 2 or 3 even. Let's see.

`We consider the discrepancy problem of coloring $n$ intervals with $k$ colors\nsuch that at each point on the line, the maximal difference between the number\nof intervals of any two colors is minimal. Somewhat surprisingly, a coloring\nwith maximal difference at most one always exists. Furthermore, we give an\nalgorithm with running time $O(n \\log n + kn \\log k)$ for its construction.\nThis is in particular interesting because many known results for discrepancy\nproblems are non-constructive. This problem naturally models a load balancing\nscenario, where $n$ tasks with given start- and endtimes have to be distributed\namong $k$ servers. Our results imply that this can be done ideally balanced.\n  When generalizing to $d$-dimensional boxes (instead of intervals), a solution\nwith difference at most one is not always possible. We show that for any $d \\ge\n2$ and any $k \\ge 2$ it is NP-complete to decide if such a solution exists,\nwhich implies also NP-hardness of the respective minimization problem.\n  In an online scenario, where intervals arrive over time and the color has to\nbe decided upon arrival, the maximal difference in the size of color classes\ncan become arbitrarily high for any online algorithm.`

`\n`'s are representing new line so ignore them but you can see `(n \\log n + kn \\log k)`. `n` are part of mathematical equations. So seeing n in top 20 is normal thinking the abstracts can likely have these kind of representations. So my idea is to keep those letters. It can help our model to find relations for specific category. There are digits too. Let's check what they represent. 

Another example:

`Concept drift refers to a non stationary learning problem over time. The\ntraining and the application data often mismatch in real life problems. In this\nreport we present a context of concept drift problem 1. We focus on the issues\nrelevant to adaptive training set formation. We present the framework and\nterminology, and formulate a global picture of concept drift learners design.\nWe start with formalizing the framework for the concept drifting data in\nSection 1. In Section 2 we discuss the adaptivity mechanisms of the concept\ndrift learners. In Section 3 we overview the principle mechanisms of concept\ndrift learners. In this chapter we give a general picture of the available\nalgorithms and categorize them based on their properties. Section 5 discusses\nthe related research fields and Section 5 groups and presents major concept\ndrift applications. This report is intended to give a bird's view of concept\ndrift research field, provide a context of the research and position it within\nbroad spectrum of research fields and applications.`

In here `Section 2` it's been used to point out section. So we cannot say it's something that we might need but in the first example, it means something:

`...factor less than 2...`

But simply removing the numbers used for purposes such as specifying a section from the text means that we lose the numbers that indicate a topic related to the usage.

In order not to lose such meanings, it would be much healthier to keep the numbers in the text. we should not lose the meanings that can lead us to the category. But we can convert them to their text representations. Like 2 -> two.

Abstracts can be long texts. So let's look at the number of tokens contained in each abstract by category for train set:

![](assets/output5.png)

For test set:

![](assets/output6.png)


There is a distribution around 150. Our data is normally distributed. We can see abstracts containing less or more than 400 tokens in train and test. there is an estimated 500 samples in the test set. Our data doesn't have huge outliers, I'm relieved.

While the token distribution in train is almost the same for each category, in the test set the AI category seems to be showing off in token maintenance.

I thought abstracts wouldn't have website links in it. I decided to look to be safe than sorry. And I was right. There are not a lot but we should get rid of them. We'll do that in another notebook.

`In recent years, quadratic weighted kappa has been growing in popularity in\nthe machine learning community as an evaluation metric in domains where the\ntarget labels to be predicted are drawn from integer ratings, usually obtained\nfrom human experts. For example, it was the metric of choice in several recent,\nhigh profile machine learning contests hosted on Kaggle :\nhttps://www.kaggle.com/c/asap-aes , https://www.kaggle.com/c/asap-sas ,\nhttps://www.kaggle.com/c/diabetic-retinopathy-detection . Yet, little is\nunderstood about the nature of this metric, its underlying mathematical\nproperties, where it fits among other common evaluation metrics such as mean\nsquared error (MSE) and correlation, or if it can be optimized analytically,\nand if so, how. Much of this is due to the cumbersome way that this metric is\ncommonly defined. In this paper we first derive an equivalent but much simpler,\nand more useful, definition for quadratic weighted kappa, and then employ this\nalternate form to address the above issues.`

Also abstracts are more likely to contain latin characters for some equations or any other representations. While creating a API, these are causing errors. In a way we should remove them from our texts. 

`In this paper, we study the two choice balls and bins process when balls are\nnot allowed to choose any two random bins, but only bins that are connected by\nan edge in an underlying graph. We show that for $n$ balls and $n$ bins, if the\ngraph is almost regular with degree $n^\\epsilon$, where $\\epsilon$ is not too\nsmall, the previous bounds on the maximum load continue to hold. Precisely, the\nmaximum load is $\\log \\log n + O(1/\\epsilon) + O(1)$. For general\n$\\Delta$-regular graphs, we show that the maximum load is $\\log\\log n +\nO(\\frac{\\log n}{\\log (\\Delta/\\log^4 n)}) + O(1)$ and also provide an almost\nmatching lower bound of $\\log \\log n + \\frac{\\log n}{\\log (\\Delta \\log n)}$.\n  V{\\"o}cking [Voc99] showed that the maximum bin size with $d$ choice load\nbalancing can be further improved to $O(\\log\\log n /d)$ by breaking ties to the\nleft. This requires $d$ random bin choices. We show that such bounds can be\nachieved by making only two random accesses and querying $d/2$ contiguous bins\nin each access. By grouping a sequence of $n$ bins into $2n/d$ groups, each of\n$d/2$ consecutive bins, if each ball chooses two groups at random and inserts\nthe new ball into the least-loaded bin in the lesser loaded group, then the\nmaximum load is $O(\\log\\log n/d)$ with high probability.`

As for what we might need for our model, such representations are still on the math side, but they may not suggest as much logic in terms of operations as numbers or letters. Of course, we cannot say that they are absolutely useless, but we have to remove them for future problems.

## Model Time

So let's set the steps:
- We will make the texts lower case.
- We'll remove equations.
- We'll remove website links.
- we will convert digits to text
- we will remove stopwords
- we will remove latin characters
- we will remove punctuations.
- Stem and lemmatization.

I am not sure yet about lemmatization. But since lemmatization doesn't cut the word abruptly like stemming, it cares about the meaning, I've written a function for it as well.

In [None]:
def remove_equations(text):
    # Remove digits and mathematical equations
    text = re.sub(r'\$.*?\$', '', text)
    return text

def remove_punctuation(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

def to_lowercase(text):
    # Convert to lowercase
    text = text.lower()
    return text

def remove_accents(text):
    # Remove accents from Latin letters
    text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8', 'ignore')
    return text

def tokenize_and_remove_stopwords(text):
    # Tokenize the text
    words = nltk.word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # Join the processed words back into a sentence
    processed_text = ' '.join(words)
    return processed_text

def lemmatize_text(text):
    # Lemmatize the text
    lemmatizer = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the lemmatized words back into a sentence
    processed_text = ' '.join(lemmatized_words)
    return processed_text

def number_to_text(text):
    # Convert digits to text representation
    p = inflect.engine()
    words = []
    for word in text.split():
        if word.isdigit():
            words.append(p.number_to_words(word))
        else:
            words.append(word)
    processed_text = ' '.join(words)
    return processed_text

def stem_text(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Initialize the Porter stemmer
    stemmer = PorterStemmer()
    
    # Stem each word in the text
    stemmed_words = [stemmer.stem(word) for word in words]
    
    # Join the stemmed words back into a sentence
    processed_text = ' '.join(stemmed_words)
    
    return processed_text

def remove_website_links(text):
    # Regular expression to match website links
    processed_text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
    
    return processed_text


One data example:

`We consider the classic budgeted maximum weight independent set (BMWIS) problem. The input is a graph G=(V,E), a weight function w:V→ℝ≥0, a cost function c:V→ℝ≥0, and a budget B∈ℝ≥0. The goal is to find an independent set S⊆V in G such that ∑v∈Sc(v)≤B, which maximizes the total weight ∑v∈Sw(v). Since the problem on general graphs cannot be approximated within ratio |V|1−ε for any ε>0, BMWIS has attracted significant attention on graph families for which a maximum weight independent set can be computed in polynomial time. Two notable such graph families are bipartite and perfect graphs. BMWIS is known to be NP-hard on both of these graph families; however, the best possible approximation guarantees for these graphs are wide open.
In this paper, we give a tight 2-approximation for BMWIS on perfect graphs and bipartite graphs. In particular, we give We a (2−ε) lower bound for BMWIS on bipartite graphs, already for the special case where the budget is replaced by a cardinality constraint, based on the Small Set Expansion Hypothesis (SSEH). For the upper bound, we design a 2-approximation for BMWIS on perfect graphs using a Lagrangian relaxation based technique. Finally, we obtain a tight lower bound for the capacitated maximum weight independent set (CMWIS) problem, the special case of BMWIS where w(v)=c(v) ∀v∈V. We show that CMWIS on bipartite and perfect graphs is unlikely to admit an efficient polynomial-time approximation scheme (EPTAS). Thus, the existing PTAS for CMWIS is essentially the best we can expect.`

You can see these figures ∑,∈,ε,∀,⊆. This figures are not unprocessible by FastAPI. We'll remove them before getting input. And such representations are still on the math side, but they may not suggest as much logic in terms of operations as numbers or letters. Of course, we cannot say that they are absolutely useless, but we have to remove them for future problems.

After the preprocess steps:

`consider classic budgeted maximum weight independent set bmwis problem input graph gve weight function wvR0 cost function cvR0 budget bR0 goal find independent set sv g vscvb maximizes total weight vswv since problem general graph approximated within ratio v1 zero bmwis attracted significant attention graph family maximum weight independent set computed polynomial time two notable graph family bipartite perfect graph bmwis known nphard graph family however best possible approximation guarantee graph wide open paper give tight 2approximation bmwis perfect graph bipartite graph particular give two lower bound bmwis bipartite graph already special case budget replaced cardinality constraint based small set expansion hypothesis sseh upper bound design 2approximation bmwis perfect graph using lagrangian relaxation based technique finally obtain tight lower bound capacitated maximum weight independent set cmwis problem special case bmwis wvcv vv show cmwis bipartite perfect graph unlikely admit efficient polynomialtime approximation scheme eptas thus existing ptas cmwis essentially best expect`

I used these algorithms for ML part:

`LogisticRegression(),
KNeighborsClassifier(),
RandomForestClassifier(),
XGBClassifier(),
DecisionTreeClassifier(),
SVC()`

To decide I used `plotly` library to visualize `recall`, `precision`, `accuracy` and `f1 score`. And choose the first two outperformed model. 

I usually look at f1 score in ml algorithms since it harmonic mean of recall and precision. More accurate than accuracy.

Machine learning algorithms can be more successful than neural networks because we have less data. It's a good idea to try it first before going to larger models. This was the most important part especially for this task. As I will point out shortly, NNs are very good at learning our train data, so I ran into an overfitting problem.

These are the results that I get:

![](assets/newplot1.png)
![](assets/newplot2.png)
![](assets/newplot3.png)
![](assets/newplot4.png)
![](assets/newplot5.png)
![](assets/newplot6.png)


Then I went for the best hyperparameters with grid search for Logistic and SVC to decide which one is the best.

`SVC F1 Score -> 0.90889

Logistic Regression -> 0.90828``

SVC results with best hyperparameters:

![](assets/newplot7.png)

With grid search we find the best hyperparameters for our model and increased the score for %1.

Confusion Matrix:

![](assets/output7.png)

Model has a weakness to call CV papers as ML papers followed by 36 with mistaken DS papers with AI papers. We can say the category with the lowest error rate is AI.

Classification report:

![](assets/output8.png)

Logistic regression results with best hyperparameters:

![](assets/newplot8.png)

Confussion matrix:

![](assets/output9.png)

Classification report:

![](assets/output10.png)

Models have almost the same scores. I looked at the darkest blues in confusion matrices. SVC's precision seems higher than logistic regression.

Category-based errors in logistic regression are slightly higher. For example, the error rate in the `ds` category is higher in logistics, but in the results, we clearly see that `cv` papers are mixed with `ml` papers and `ds` papers are mixed with `ai` papers by the model. To be honest, it is necessary to look at very small things, but I need to choose the best one to compare with the neural network model.

## Neural Network

We had good luck with the model on the ml side. they perform well enough on the last two models. As I mentioned at the beginning, we have little data and we are trying to make predictions about a wide-ranging subject. In such cases, neural networks should not be the first result to go. like I did.

I applied the same preprocesses and tried various networks. I changed hyperparameters such as dimension, neuron, epoch vs. I encountered overfitting in every result. Finally, I tried to increase the train set by transferring a thousand data from the test meat to the train, but the result was the same.

Let's break it down.

I set vocab size and max seq length to pad our every input to same length.

`MAX_NB_WORDS = 5000`

`MAX_SEQUENCE_LENGTH = 400`

Tokenize and pad the sequences:


In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train.stemmed_text)

word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size :", vocab_size)

x_train = pad_sequences(tokenizer.texts_to_sequences(train.stemmed_text),
                        maxlen = MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(test.stemmed_text),
                       maxlen = MAX_SEQUENCE_LENGTH)

One hot encode the targets:

`y_train = tf.one_hot(train["target"], depth=len(mapping))`

`y_test = tf.one_hot(test["target"], depth=len(mapping))`

First network:

In [None]:
inputs = Input(name='inputs',shape=[MAX_SEQUENCE_LENGTH])
layer = Embedding(vocab_size,50,input_length=MAX_SEQUENCE_LENGTH)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256,"relu",name='FC1')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(4,"softmax")(layer)
model = Model(inputs=inputs,outputs=layer)

I used f1 score and accuracy as metrics and Adam as optimizer.

On the training set everything was great:

![](assets/output11.png)

Looks too good to be true till we see the evaluation on the test set:

![](assets/output12.png)

This model is not generaling on the unseen dataset. That's why the results are poorly on the test set.

I increased the `embedding_dim` 50 to 200:



In [None]:
inputs = Input(name='inputs',shape=[MAX_SEQUENCE_LENGTH])
layer = Embedding(vocab_size,200,input_length=MAX_SEQUENCE_LENGTH)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256,"relu",name='FC1')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(4,"softmax")(layer)
model = Model(inputs=inputs,outputs=layer)

Result is no different with a huge overfitting:

![](assets/output13.png)

Evaluate:

![](assets/output14.png)

Even though I added `Dropout` layers to prevent overfitting but it didn't work.

Then I thought this network might be a big for this kind of small dataset.

Tried this network but had to train it on Kaggle since it did not work on my local:



In [None]:
model = Sequential()

embedding_dim = 100 
model.add(Embedding(vocab_size, embedding_dim, input_length=MAX_SEQUENCE_LENGTH))

model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(4, activation='softmax'))

metric = tf.metrics.CategoricalAccuracy()
opt = tf.keras.optimizers.legacy.Adam()
model.compile(loss='categorical_crossentropy',optimizer=opt,metrics=['accuracy', metric])

![](assets/download.png)
![](assets/download-1.png)

52 ml papers got mistaken as cv paper by the model. 

And as you can see that f1 score is %91 percent and the model's accuracy were about %98. f1 score and confusion matrix might look great but when you compare it with the train set score it is not that good. It's not something we would go for for best case. We can work on it to get better but still not what we need. 

Confusion matrix looks same with the SVC, it's because both perform almost same on test set.

I also tried trained GloVE embeddings which has 400000 word vectors. By using Glove embeddings as features for our model, I aim to benefit from the rich information encoded in these embeddings and potentially improve the performance of model.




In [None]:
GLOVE_EMB = '/kaggle/input/glove-embeddings/glove.6B.300d.txt'
EMBEDDING_DIM = 300
LR = 1e-3
BATCH_SIZE = 32
EPOCHS = 40

embeddings_index = {}

f = open(GLOVE_EMB)
for line in f:
    values = line.split()
    word = value = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' %len(embeddings_index))

embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
embedding_layer = tf.keras.layers.Embedding(vocab_size,
                                          EMBEDDING_DIM,
                                          weights=[embedding_matrix],
                                          input_length=MAX_SEQUENCE_LENGTH,
                                          trainable=False)

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_sequences = embedding_layer(sequence_input)
x = SpatialDropout1D(0.2)(embedding_sequences)
x = Conv1D(64, 5, activation='relu')(x)
x = Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2))(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
outputs = Dense(4, activation='softmax')(x)
model = tf.keras.Model(sequence_input, outputs)

metr = tf.metrics.CategoricalAccuracy()
model.compile(optimizer=Adam(), loss='categorical_crossentropy',
              metrics=['accuracy', metr])

# Create EarlyStopping callback
early_stopping_callback = EarlyStopping(
    monitor='val_loss',    # Metric to monitor for early stopping (e.g., validation loss)
    patience=5
)

# Create ModelCheckpoint callback
model_checkpoint_callback = ModelCheckpoint(
    filepath='best_model',   # Filepath to save the best model
    monitor='val_loss',         # Metric to monitor for saving the best model
    save_best_only=True,        # Save only the best model based on the monitored metric
    save_weights_only=False,    # Save the entire model, including architecture and weights
    verbose=1                   # Verbosity level: 0 (silent), 1 (progress bar), 2 (one line per epoch)
)

y_train = tf.one_hot(train["target"], depth=len(mapping))
y_test = tf.one_hot(test["target"], depth=len(mapping))

history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS,
                    validation_data=(x_test, y_test), callbacks=[early_stopping_callback, model_checkpoint_callback])

The results:

![](assets/glove_results4.png)

While performing 0.9223 as accuracy preforms 0.8715 on test.


![](assets/glove_results.png)

In confusion matrix cv paper's error rate is very high. 52 paper classified as ml paper. Also for AI, 51 paper classified as DS. Those error rates are a bit high.

![](assets/glove_results2.png)

![](assets/glove_results3.png)

Then, I increased `Dropout` to 0.6.

In [None]:
nputs = Input(name='inputs',shape=[MAX_SEQUENCE_LENGTH])
layer = Embedding(vocab_size,50,input_length=MAX_SEQUENCE_LENGTH)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256,"relu",name='FC1')(layer)
layer = Dropout(0.6)(layer)
layer = Dense(128,"relu",name='FC2')(layer)
layer = Dropout(0.6)(layer)
layer = Dense(4,"softmax")(layer)
model = Model(inputs=inputs,outputs=layer)

On test set:

> loss: 1.0026 - accuracy: 0.8155 - f1_score: 0.8162

True predictions are also decreased:

![](assets/output15.png)

I increased the train data by adding 1000 entry from test set. Tried the same networks. Not much changed. DS accurate estimates are noticeably less than the others, even with decreasing data. and ds error rate is also big compared to others. 50 samples estimated as AI.

![](assets/increasedata.png)

![](assets/increasedata2.png)

Lastly I added `BatchNormalization` to normal data since nothing has changed. True predictions increased but error rate also.

![](assets/increasedropout.png)


You can see in the left below corner. ML and DS categories are mistaken as AI paper and those values are not small.

In [None]:
# The network for BatchNormalization

model = Sequential()

embedding_dim = 100 
model.add(Embedding(vocab_size, embedding_dim, input_length=MAX_SEQUENCE_LENGTH))

model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.2))

model.add(tf.keras.layers.BatchNormalization())

model.add(Dense(32, "relu"))

model.add(Dropout(.5))

model.add(Dense(4, activation='softmax'))

metric = tf.metrics.CategoricalAccuracy()
opt = tf.keras.optimizers.legacy.Adam()
model.compile(loss='categorical_crossentropy',optimizer=opt,metrics=['accuracy', metric])

Lastly, I tried quite a few things. I tried not to make the network too big. As it turned out, growing the model could have turned out much worse, as the model went from the task of understanding the data to memorizing.

I added layers such as BatchNormalization, Dropout and played with hyperparameters. I did more of these but forgot to save the results from trying all the time 🫠

Using more data, I think we could get a nice result from this problem with glove embeddings but for now glove embeddings also failed.

Paper abstracts can sometimes contain many operations, and we have removed some components of these operations from texts. Such as punctuation, operation operators. Such removals may have damaged the semantic integrity of some texts. Increasing the number of data can also help the model to generalize in this sense.

Finally, I decided to move forward with SVC by looking at the scores and the distribution of the predictions. Most of the papers I submitted on the site so far answered correctly, but of course I encountered a few mistakes. Let me talk a little bit about them.

Example input is belong to DS category:

`Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a close relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.`

But model is saying it is `ML`. In my opinion, these errors occur in texts containing words such as `learning`, `linear`, `data` that are also included in the ML field a lot. Normally, DS contained unique words in our data. I have included the most frequently mentioned words above. In the abstract, seeing vocabs that are close to other categories other than their own vocab may lead to these inaccuracies.

Another examples for the same topic:

`The problem of continual learning in the domain of reinforcement learning, often called non-stationary reinforcement learning, has been identified as an important challenge to the application of reinforcement learning. We prove a worst-case complexity result, which we believe captures this challenge: Modifying the probabilities or the reward of a single state-action pair in a reinforcement learning problem requires an amount of time almost as large as the number of states in order to keep the value function up to date, unless the strong exponential time hypothesis (SETH) is false; SETH is a widely accepted strengthening of the P 
≠ NP conjecture. Recall that the number of states in current applications of reinforcement learning is typically astronomical. In contrast, we show that just 
adding a new state-action pair is considerably easier to implement.`

`We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder knows the item's value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction's transparency, which regulates the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Our results hold under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder's valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.`

According to what I have observed, the model usually puts ML labels on abstracts that do not contain formulas or numeric values and contain the words I mentioned above. But it gives the correct DS label to papers that contain words that are frequently encountered in our data, such as graph and time. This shows that most of the DS labeled abstracts in our train set contain such keywords a lot.

I did not encounter many errors in other categories in my experiments. The DS was the most conspicuous. If I try harder, of course, I can find it, because the models can be improved as I mentioned.

I hope the report was self explanatory 🤗