In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import json
import re


import unidecode #for removing accents from strings.


import os

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords

from nltk.stem.porter import PorterStemmer


# Overview of the Data


The data we use for training our model are from two sources. 

The first is Arxiv.org's datadump on Kaggle. Arxiv is a online scientific preprint repository, and the dump contains the metadata for 1.7 million preprints uploaded to the site, including their titles, category, authors, and date uploaded. I labeled each category as either pure or applied and use this as one dataset to train my model on. 

Additionally, I scraped around 400,000 titles and their categories from 2001 to 2006 from Mathscinet. Mathscinet classifies papers using 61 supercategories, each with several subcategories for a total of 5000+ different categorical labels. Again, I labeled each supercategory as either being pure or applied. Below I discuss the labeling in more detail. 

We will use the Arxiv dataset for training Word2Vec embeddings, the reason being the dataset also contains abstracts (which supply much more text) and also vocabulary from areas outside of math, which will be important for classifying papers as pure/applied (indeed, earlier experiments just using math abstracts from arxiv to train Word2Vec yeilded worse performance). The dimension of the embeddings is 256 (I tried 512 before, but using PCA one could see the explained variance plateau a bit after 200).

In another notebook, we will train neural networks with the same architectures separately on the labeled arxiv titles and the labeled mathscinet titles. We keep them separate since, as the categorization rules are different between the two sites, we can't guarantee that our labeling of pure/applied will be consistent. 

Below we show how we cleaned and prepared the data and how we learned the Word2Vec embeddings. 

A word of warning, this takes a lot of time. If you wanted to run this or something similar for yourself, let the whole notebook run overnight.

We start first with the Arxiv dataset.


# Labelling Arxiv titles


The main categories for publications used by Arxiv are:

Computer Science
Physics
Mathematics
Economics
Quantitative Finance
Quantitative Biology
Statistics
Electrical Engineering and Systems Science

Within mathematics, there are several subcategories:



In [2]:
pure = ["math." + cat for cat in ["AC","AG","AP","AT","CA","CO","CT","CV","DG","DS","FA","GN","GR","GT","HO","KT",
                                  "LO","MG","MP","NT","OA","QA","RA","RT","SG","SP"]] + ["math-ph"]


applied = ["math." + cat for cat in ["IT","NA","OC","OR","ST","PR"]]

arxiv_math_categories = pure + applied


names = [
"Commutative Algebra",
"Algebraic Geometry",
"Analysis of PDEs",
"Algebraic Topology",
"Classical Analysis and ODEs",
"Combinatorics",
"Category Theory",
"Complex Variables",
"Differential Geometry",
"Dynamical Systems",
"Functional Analysis",
"General Topology",
"Group Theory",
"Geometric Topology",
"History and Overview",
"K-Theory and Homology",
"Logic",
"Metric Geometry",
"Mathematical Physics",
"Number Theory",
"Operator Algebras",
"Quantum Algebra",
"Rings and Algebras",
"Representation Theory",
"Symplectic Geometry",
"Spectral Theory",
    "Also Math Physics",
"Information Technology",
    "Numerical Analysis",
    "Optimization and Control",
    "Operations Research",
"Probability",
    "Statistics Theory"]

category_df = pd.DataFrame({"Category": names, "Abbreviation": arxiv_math_categories})
category_df

Unnamed: 0,Category,Abbreviation
0,Commutative Algebra,math.AC
1,Algebraic Geometry,math.AG
2,Analysis of PDEs,math.AP
3,Algebraic Topology,math.AT
4,Classical Analysis and ODEs,math.CA
5,Combinatorics,math.CO
6,Category Theory,math.CT
7,Complex Variables,math.CV
8,Differential Geometry,math.DG
9,Dynamical Systems,math.DS


There is also a General Math (math.GM) category, but since those topics could be of any category, we omit them.

Our focus will be on mathematics papers since that's what I'm most familiar with. We will consider two different classifiers: one is a multilabel classifier that tries to identify which of the above math categories belongs to, and we will also try to classify papers as pure or applied. For a  math paper to be "applied", I mean it is either in Statistics, Probability, Optimization and Control, Numerical Analysis, and Information Technology. 

My reasoning for this labelling is as follows: statitics and computer programming are highly saught-after skills, so Statistics, Numerical Analysis, and IT should be included. Depending on the kind of probability you do, it might be more "pure" or "applied" in the traditional sense, but I put it in the "applied" category since finance jobs seek people who understand concepts like Brownian Motion and Stochastic PDEs. By searching on a job site Monster.com one sees there are jobs asking for people who know control theory or optimization, so I consider this category to be applied as well. While mathematical physics would seem to be an applied subject, it is more on the theoretical side.

There are a few weaknesses in this labeling scheme which I list below that perhaps a future project could improve upon.

Now just because a paper is in any other field does not indicate that the authors don't know or use programming. I've known people to use programming to study pure mathematics problems in algebra and PDEs. However, someone graduating with a PhD in this field is not guaranteed to have demonstrated in their thesis work that they can program. 

There are subjects where there is more crossover: some papers in combinatorics are crosslisted in computer science and study algorithms or discrete geometry. My decision to label this as "pure" comes from googling the question "I did a PhD in combinatorics and want to switch to industry" and seeing what discussions arise. It seems that most combinatorics is still quite abstract, and there are students asking this question who are advised that, while having a math PhD is desirable and combinatorics sure helps in programming, they should supplement this with learning to program. 

Another weakness in the classification is that I classified any paper categorized as a non-math topic as "applied". This was initially with the assumption that working in physics implies some numerate or statistical ability, however, not all physics students specialize in experimental physics (there is also theoretical/mathematical physics), and I have seen articles online of physicists charting their transition from research to data science, which indicates that being, so apparently moving out of physics is not 100% straightforward. However, I still labeled physics as "applied" as I do not have the knowledge to sift through the subcategories and distinguish which are more employable than others.


# Cleaning the Arxiv metadata: all about the $

The Arxiv metadata comes as a json file. We convert it to some csv files for convenience with some additional features:

First, we add a separate column for whether the paper is "pure" or "applied" by choosing some math categories to designate as "pure" and label everything else as applied. 

Furthermore, we create a new column for the primary category (so for each item, we look  at the "categories" item and select the first listed label. 

At the same time, we will make separate csv files consisting only of preprocessed titles and abstracts (and separate ones for those that are math papers), along with their categories and the pure/applied label. 

The preprocessing involves some usual text cleaning methods but also some that are particular to scientific papers. 

Firstly, many titles and abstracts use LaTeX commands. Instead of trying to treat the math written in the latex commands as words, we will just filter them out of the text. Commands are enclosed in dollar signs, so this is easy. 

Another issue are accents, since some people write these either using an actual unicode character like ö, they may use the LaTeX equivalent, which is \"o (without dollar signs). We need to filter these out. Luckily, there is usually a slash followed by another character (some exceptions are \o{} for ø or {\i} for $\iota$, but we will allow those to save on computational cost) and we replace them with an empty character (after removing newlines \n with a space).

Apart from that is the usual stuff: remove stopwords, deaccenting, stemming, etc. 

Experiments earlier that just removed all symbols still performed well, but the extra processing improved performance by several percentage points. 



In [3]:

#The following processes a string into a list of tokens to pass into gensim phraser. 


stemmer = PorterStemmer()
def get_string_tokens(string):

    string = re.sub("(\$.+?\$)","",string, flags = re.DOTALL) #Removes Latex
    string = remove_stopwords(string.lower()) #Removes Stopwords and lowers all characters
    string = unidecode.unidecode(string) #Replaces accents, e.g. ö --> o
    string = string.replace("\n"," ") #Replace newlines with just a space
    string = re.sub("(\\\\+.)","",string) #This removes some remaining latex often used for accents, like Carath\'eodory --> Caratheodory.
    string = re.sub("([\{\}]+)","",string) #Some accents are written  Carath\'{e}odory, which becomes Carath{e}odory by the previous step, so we remove the brackets.
    tokens = simple_preprocess(string, max_len=100) #Removes symbols and returns list of tokens
    tokens = [stemmer.stem(token) for token in tokens] #stems tokens
    
    return tokens

    
    

def get_metadata(data_file):
    with open(data_file, 'r') as f:
        for line in f:
            yield json.loads(line)
                
def create_arxiv_csv(data_file = 'arxiv-metadata-oai-snapshot.json', pure=pure):


    """ Using `yield` to load the JSON file in a loop to prevent Python memory issues if JSON is loaded directly"""
    
    
    

    get = get_metadata(data_file)

    i=0
    header = True
    mode= "w"
    for article in get:

        i+=1
        if "math.GM" in article["categories"]:
            continue
        else:


            categories = article["categories"].split(" ")

            main_category = categories[0]
            if main_category in pure:
#                 if all([category in pure for category in categories]):
                is_pure = 1
            else:
                is_pure = 0

            
            title = article["title"]
            
            processed_title = " ".join(get_string_tokens(title))
            
            abstract = article["abstract"]
            
            processed_abstract = " ".join(get_string_tokens(abstract))
            


            article = pd.DataFrame({"title":[article["title"]],
                                    "abstract":[article["abstract"]],
                                    "processed_abstract":[processed_abstract],
                                    "processed_title":[processed_title],
                                    "date":[article["versions"][0]["created"]],
                                    "categories": [article["categories"]],
                                    "main_category":[main_category],
                                    "pure": [is_pure]
                                       })

            article[["title","abstract","date","categories","main_category","pure"]].to_csv("arxiv.csv",
                                                                                               header=header,
                                                                                               mode=mode,
                                                                                              index=False)
            
            article[["processed_title","main_category","pure"]].to_csv("processed_titles.csv",
                                                                          header=header,
                                                                          mode=mode,  index=False)
            
            article[["processed_abstract","main_category","pure"]].to_csv("processed_abstracts.csv",
                                                                          header=header,
                                                                          mode=mode, index=False)
            
            math = article[article["main_category"].str.contains("math.", regex=False)]
            
            math[["processed_title","categories","main_category","pure"]].to_csv("processed_math_titles.csv",
                                                                          header=header,
                                                                          mode=mode,  index=False)
            
            math[["processed_abstract","main_category","pure"]].to_csv("processed_math_abstracts.csv",
                                                                          header=header,
                                                                          mode=mode, index=False)
            

            print(f"\r{i} articles processed...",end="")
            
            header=False
            mode = "a"
            

    print("File written successfully")



create_arxiv_csv(data_file = 'arxiv-metadata-oai-snapshot.json', pure=pure)



1789907 articles processed...File written successfully


# Bigrams



After that we will learn bigrams from the titles/abstracts. Since these are stored in large files, we need to pass a generator to Gensim's Phrases model to learn them instead of storing them in memory. To save on time, we will create a generator that also does the preprocessing into csv files as described above. 



In [8]:


#This class iterates through the arxiv file, yielding processed titles/abstracts

#We will also make another file with the processed abstracts so we don't have to do the processing again

class get_text_gen:
    def __init__(self, file_name = "processed_abstracts.csv", column = "processed_abstract", preprocessed=True):
        self.file_name = file_name
        self.column = column
        
        if preprocessed == True:
            self.process = lambda x: x.split(" ")
        else:
            self.process = get_string_tokens
        

    def __iter__(self):
        
        i=0
        
        file = pd.read_csv(self.file_name,chunksize=10000)
        
        for chunk in file:
        
            texts = chunk[self.column]

            for text in texts:
                yield self.process(text)
                i+=1
                print(f"\r{i} articles yielded...",end="")
            
            
                

generator = get_text_gen()


#If passed a file_name and generator, this will learn the bigrams from the strings yielded from the 
#generator and then save the model as file_name; if no generator is given, it will instead load
#a previously saved bigram model from file_name
def get_bigram(file_name, generator=None):

    if generator == None:
        bigram = gensim.models.phrases.Phrases.load(file_name)
        print("Load successful!")
        return bigram
        
    else:
        bigram = gensim.models.phrases.Phrases(generator, min_count=10, threshold=2)

        print("Finished finding Bigrams!")
        
        bigram.save(file_name)
        
        print("Model Saved!")
        return bigram

bigrammer = get_bigram(file_name = "bigram.model",generator=generator)


#For a sanity check, let's see how the bigrammer combined with the processor handles some common strings

string = "Markov Chain Monte Carlo, Fourier Transforms, noncommutative algebra, rectifiable sets"
tokens = get_string_tokens(string)
bigrammer[tokens]

1 articles yielded...2 articles yielded...3 articles yielded...4 articles yielded...5 articles yielded...6 articles yielded...7 articles yielded...8 articles yielded...9 articles yielded...10 articles yielded...11 articles yielded...12 articles yielded...13 articles yielded...14 articles yielded...15 articles yielded...16 articles yielded...17 articles yielded...18 articles yielded...19 articles yielded...20 articles yielded...21 articles yielded...22 articles yielded...23 articles yielded...24 articles yielded...25 articles yielded...26 articles yielded...27 articles yielded...28 articles yielded...29 articles yielded...30 articles yielded...31 articles yielded...32 articles yielded...33 articles yielded...34 articles yielded...35 articles yielded...36 articles yielded...37 articles yielded...38 articles yielded...39 articles yielded...40 articles yielded...41 articles yielded...42 articles yielded...43 articles yielded...44 articles yielded

383 articles yielded...384 articles yielded...385 articles yielded...386 articles yielded...387 articles yielded...388 articles yielded...389 articles yielded...390 articles yielded...391 articles yielded...392 articles yielded...393 articles yielded...394 articles yielded...395 articles yielded...396 articles yielded...397 articles yielded...398 articles yielded...399 articles yielded...400 articles yielded...401 articles yielded...402 articles yielded...403 articles yielded...404 articles yielded...405 articles yielded...406 articles yielded...407 articles yielded...408 articles yielded...409 articles yielded...410 articles yielded...411 articles yielded...412 articles yielded...413 articles yielded...414 articles yielded...415 articles yielded...416 articles yielded...417 articles yielded...418 articles yielded...419 articles yielded...420 articles yielded...421 articles yielded...422 articles yielded...423 articles yielded...424 articles yi

804 articles yielded...805 articles yielded...806 articles yielded...807 articles yielded...808 articles yielded...809 articles yielded...810 articles yielded...811 articles yielded...812 articles yielded...813 articles yielded...814 articles yielded...815 articles yielded...816 articles yielded...817 articles yielded...818 articles yielded...819 articles yielded...820 articles yielded...821 articles yielded...822 articles yielded...823 articles yielded...824 articles yielded...825 articles yielded...826 articles yielded...827 articles yielded...828 articles yielded...829 articles yielded...830 articles yielded...831 articles yielded...832 articles yielded...833 articles yielded...834 articles yielded...835 articles yielded...836 articles yielded...837 articles yielded...838 articles yielded...839 articles yielded...840 articles yielded...841 articles yielded...842 articles yielded...843 articles yielded...844 articles yielded...845 articles yi

1252 articles yielded...1253 articles yielded...1254 articles yielded...1255 articles yielded...1256 articles yielded...1257 articles yielded...1258 articles yielded...1259 articles yielded...1260 articles yielded...1261 articles yielded...1262 articles yielded...1263 articles yielded...1264 articles yielded...1265 articles yielded...1266 articles yielded...1267 articles yielded...1268 articles yielded...1269 articles yielded...1270 articles yielded...1271 articles yielded...1272 articles yielded...1273 articles yielded...1274 articles yielded...1275 articles yielded...1276 articles yielded...1277 articles yielded...1278 articles yielded...1279 articles yielded...1280 articles yielded...1281 articles yielded...1282 articles yielded...1283 articles yielded...1284 articles yielded...1285 articles yielded...1286 articles yielded...1287 articles yielded...1288 articles yielded...1289 articles yielded...1290 articles yielded...1291 articles yielded...

1769 articles yielded...1770 articles yielded...1771 articles yielded...1772 articles yielded...1773 articles yielded...1774 articles yielded...1775 articles yielded...1776 articles yielded...1777 articles yielded...1778 articles yielded...1779 articles yielded...1780 articles yielded...1781 articles yielded...1782 articles yielded...1783 articles yielded...1784 articles yielded...1785 articles yielded...1786 articles yielded...1787 articles yielded...1788 articles yielded...1789 articles yielded...1790 articles yielded...1791 articles yielded...1792 articles yielded...1793 articles yielded...1794 articles yielded...1795 articles yielded...1796 articles yielded...1797 articles yielded...1798 articles yielded...1799 articles yielded...1800 articles yielded...1801 articles yielded...1802 articles yielded...1803 articles yielded...1804 articles yielded...1805 articles yielded...1806 articles yielded...1807 articles yielded...1808 articles yielded...

2331 articles yielded...2332 articles yielded...2333 articles yielded...2334 articles yielded...2335 articles yielded...2336 articles yielded...2337 articles yielded...2338 articles yielded...2339 articles yielded...2340 articles yielded...2341 articles yielded...2342 articles yielded...2343 articles yielded...2344 articles yielded...2345 articles yielded...2346 articles yielded...2347 articles yielded...2348 articles yielded...2349 articles yielded...2350 articles yielded...2351 articles yielded...2352 articles yielded...2353 articles yielded...2354 articles yielded...2355 articles yielded...2356 articles yielded...2357 articles yielded...2358 articles yielded...2359 articles yielded...2360 articles yielded...2361 articles yielded...2362 articles yielded...2363 articles yielded...2364 articles yielded...2365 articles yielded...2366 articles yielded...2367 articles yielded...2368 articles yielded...2369 articles yielded...2370 articles yielded...

2739 articles yielded...2740 articles yielded...2741 articles yielded...2742 articles yielded...2743 articles yielded...2744 articles yielded...2745 articles yielded...2746 articles yielded...2747 articles yielded...2748 articles yielded...2749 articles yielded...2750 articles yielded...2751 articles yielded...2752 articles yielded...2753 articles yielded...2754 articles yielded...2755 articles yielded...2756 articles yielded...2757 articles yielded...2758 articles yielded...2759 articles yielded...2760 articles yielded...2761 articles yielded...2762 articles yielded...2763 articles yielded...2764 articles yielded...2765 articles yielded...2766 articles yielded...2767 articles yielded...2768 articles yielded...2769 articles yielded...2770 articles yielded...2771 articles yielded...2772 articles yielded...2773 articles yielded...2774 articles yielded...2775 articles yielded...2776 articles yielded...2777 articles yielded...2778 articles yielded...

3294 articles yielded...3295 articles yielded...3296 articles yielded...3297 articles yielded...3298 articles yielded...3299 articles yielded...3300 articles yielded...3301 articles yielded...3302 articles yielded...3303 articles yielded...3304 articles yielded...3305 articles yielded...3306 articles yielded...3307 articles yielded...3308 articles yielded...3309 articles yielded...3310 articles yielded...3311 articles yielded...3312 articles yielded...3313 articles yielded...3314 articles yielded...3315 articles yielded...3316 articles yielded...3317 articles yielded...3318 articles yielded...3319 articles yielded...3320 articles yielded...3321 articles yielded...3322 articles yielded...3323 articles yielded...3324 articles yielded...3325 articles yielded...3326 articles yielded...3327 articles yielded...3328 articles yielded...3329 articles yielded...3330 articles yielded...3331 articles yielded...3332 articles yielded...3333 articles yielded...

3822 articles yielded...3823 articles yielded...3824 articles yielded...3825 articles yielded...3826 articles yielded...3827 articles yielded...3828 articles yielded...3829 articles yielded...3830 articles yielded...3831 articles yielded...3832 articles yielded...3833 articles yielded...3834 articles yielded...3835 articles yielded...3836 articles yielded...3837 articles yielded...3838 articles yielded...3839 articles yielded...3840 articles yielded...3841 articles yielded...3842 articles yielded...3843 articles yielded...3844 articles yielded...3845 articles yielded...3846 articles yielded...3847 articles yielded...3848 articles yielded...3849 articles yielded...3850 articles yielded...3851 articles yielded...3852 articles yielded...3853 articles yielded...3854 articles yielded...3855 articles yielded...3856 articles yielded...3857 articles yielded...3858 articles yielded...3859 articles yielded...3860 articles yielded...3861 articles yielded...

4292 articles yielded...4293 articles yielded...4294 articles yielded...4295 articles yielded...4296 articles yielded...4297 articles yielded...4298 articles yielded...4299 articles yielded...4300 articles yielded...4301 articles yielded...4302 articles yielded...4303 articles yielded...4304 articles yielded...4305 articles yielded...4306 articles yielded...4307 articles yielded...4308 articles yielded...4309 articles yielded...4310 articles yielded...4311 articles yielded...4312 articles yielded...4313 articles yielded...4314 articles yielded...4315 articles yielded...4316 articles yielded...4317 articles yielded...4318 articles yielded...4319 articles yielded...4320 articles yielded...4321 articles yielded...4322 articles yielded...4323 articles yielded...4324 articles yielded...4325 articles yielded...4326 articles yielded...4327 articles yielded...4328 articles yielded...4329 articles yielded...4330 articles yielded...4331 articles yielded...

4742 articles yielded...4743 articles yielded...4744 articles yielded...4745 articles yielded...4746 articles yielded...4747 articles yielded...4748 articles yielded...4749 articles yielded...4750 articles yielded...4751 articles yielded...4752 articles yielded...4753 articles yielded...4754 articles yielded...4755 articles yielded...4756 articles yielded...4757 articles yielded...4758 articles yielded...4759 articles yielded...4760 articles yielded...4761 articles yielded...4762 articles yielded...4763 articles yielded...4764 articles yielded...4765 articles yielded...4766 articles yielded...4767 articles yielded...4768 articles yielded...4769 articles yielded...4770 articles yielded...4771 articles yielded...4772 articles yielded...4773 articles yielded...4774 articles yielded...4775 articles yielded...4776 articles yielded...4777 articles yielded...4778 articles yielded...4779 articles yielded...4780 articles yielded...4781 articles yielded...

5333 articles yielded...5334 articles yielded...5335 articles yielded...5336 articles yielded...5337 articles yielded...5338 articles yielded...5339 articles yielded...5340 articles yielded...5341 articles yielded...5342 articles yielded...5343 articles yielded...5344 articles yielded...5345 articles yielded...5346 articles yielded...5347 articles yielded...5348 articles yielded...5349 articles yielded...5350 articles yielded...5351 articles yielded...5352 articles yielded...5353 articles yielded...5354 articles yielded...5355 articles yielded...5356 articles yielded...5357 articles yielded...5358 articles yielded...5359 articles yielded...5360 articles yielded...5361 articles yielded...5362 articles yielded...5363 articles yielded...5364 articles yielded...5365 articles yielded...5366 articles yielded...5367 articles yielded...5368 articles yielded...5369 articles yielded...5370 articles yielded...5371 articles yielded...5372 articles yielded...

5795 articles yielded...5796 articles yielded...5797 articles yielded...5798 articles yielded...5799 articles yielded...5800 articles yielded...5801 articles yielded...5802 articles yielded...5803 articles yielded...5804 articles yielded...5805 articles yielded...5806 articles yielded...5807 articles yielded...5808 articles yielded...5809 articles yielded...5810 articles yielded...5811 articles yielded...5812 articles yielded...5813 articles yielded...5814 articles yielded...5815 articles yielded...5816 articles yielded...5817 articles yielded...5818 articles yielded...5819 articles yielded...5820 articles yielded...5821 articles yielded...5822 articles yielded...5823 articles yielded...5824 articles yielded...5825 articles yielded...5826 articles yielded...5827 articles yielded...5828 articles yielded...5829 articles yielded...5830 articles yielded...5831 articles yielded...5832 articles yielded...5833 articles yielded...5834 articles yielded...

6137 articles yielded...6138 articles yielded...6139 articles yielded...6140 articles yielded...6141 articles yielded...6142 articles yielded...6143 articles yielded...6144 articles yielded...6145 articles yielded...6146 articles yielded...6147 articles yielded...6148 articles yielded...6149 articles yielded...6150 articles yielded...6151 articles yielded...6152 articles yielded...6153 articles yielded...6154 articles yielded...6155 articles yielded...6156 articles yielded...6157 articles yielded...6158 articles yielded...6159 articles yielded...6160 articles yielded...6161 articles yielded...6162 articles yielded...6163 articles yielded...6164 articles yielded...6165 articles yielded...6166 articles yielded...6167 articles yielded...6168 articles yielded...6169 articles yielded...6170 articles yielded...6171 articles yielded...6172 articles yielded...6173 articles yielded...6174 articles yielded...6175 articles yielded...6176 articles yielded...

6620 articles yielded...6621 articles yielded...6622 articles yielded...6623 articles yielded...6624 articles yielded...6625 articles yielded...6626 articles yielded...6627 articles yielded...6628 articles yielded...6629 articles yielded...6630 articles yielded...6631 articles yielded...6632 articles yielded...6633 articles yielded...6634 articles yielded...6635 articles yielded...6636 articles yielded...6637 articles yielded...6638 articles yielded...6639 articles yielded...6640 articles yielded...6641 articles yielded...6642 articles yielded...6643 articles yielded...6644 articles yielded...6645 articles yielded...6646 articles yielded...6647 articles yielded...6648 articles yielded...6649 articles yielded...6650 articles yielded...6651 articles yielded...6652 articles yielded...6653 articles yielded...6654 articles yielded...6655 articles yielded...6656 articles yielded...6657 articles yielded...6658 articles yielded...6659 articles yielded...

7050 articles yielded...7051 articles yielded...7052 articles yielded...7053 articles yielded...7054 articles yielded...7055 articles yielded...7056 articles yielded...7057 articles yielded...7058 articles yielded...7059 articles yielded...7060 articles yielded...7061 articles yielded...7062 articles yielded...7063 articles yielded...7064 articles yielded...7065 articles yielded...7066 articles yielded...7067 articles yielded...7068 articles yielded...7069 articles yielded...7070 articles yielded...7071 articles yielded...7072 articles yielded...7073 articles yielded...7074 articles yielded...7075 articles yielded...7076 articles yielded...7077 articles yielded...7078 articles yielded...7079 articles yielded...7080 articles yielded...7081 articles yielded...7082 articles yielded...7083 articles yielded...7084 articles yielded...7085 articles yielded...7086 articles yielded...7087 articles yielded...7088 articles yielded...7089 articles yielded...

7523 articles yielded...7524 articles yielded...7525 articles yielded...7526 articles yielded...7527 articles yielded...7528 articles yielded...7529 articles yielded...7530 articles yielded...7531 articles yielded...7532 articles yielded...7533 articles yielded...7534 articles yielded...7535 articles yielded...7536 articles yielded...7537 articles yielded...7538 articles yielded...7539 articles yielded...7540 articles yielded...7541 articles yielded...7542 articles yielded...7543 articles yielded...7544 articles yielded...7545 articles yielded...7546 articles yielded...7547 articles yielded...7548 articles yielded...7549 articles yielded...7550 articles yielded...7551 articles yielded...7552 articles yielded...7553 articles yielded...7554 articles yielded...7555 articles yielded...7556 articles yielded...7557 articles yielded...7558 articles yielded...7559 articles yielded...7560 articles yielded...7561 articles yielded...7562 articles yielded...

7942 articles yielded...7943 articles yielded...7944 articles yielded...7945 articles yielded...7946 articles yielded...7947 articles yielded...7948 articles yielded...7949 articles yielded...7950 articles yielded...7951 articles yielded...7952 articles yielded...7953 articles yielded...7954 articles yielded...7955 articles yielded...7956 articles yielded...7957 articles yielded...7958 articles yielded...7959 articles yielded...7960 articles yielded...7961 articles yielded...7962 articles yielded...7963 articles yielded...7964 articles yielded...7965 articles yielded...7966 articles yielded...7967 articles yielded...7968 articles yielded...7969 articles yielded...7970 articles yielded...7971 articles yielded...7972 articles yielded...7973 articles yielded...7974 articles yielded...7975 articles yielded...7976 articles yielded...7977 articles yielded...7978 articles yielded...7979 articles yielded...7980 articles yielded...7981 articles yielded...

8339 articles yielded...8340 articles yielded...8341 articles yielded...8342 articles yielded...8343 articles yielded...8344 articles yielded...8345 articles yielded...8346 articles yielded...8347 articles yielded...8348 articles yielded...8349 articles yielded...8350 articles yielded...8351 articles yielded...8352 articles yielded...8353 articles yielded...8354 articles yielded...8355 articles yielded...8356 articles yielded...8357 articles yielded...8358 articles yielded...8359 articles yielded...8360 articles yielded...8361 articles yielded...8362 articles yielded...8363 articles yielded...8364 articles yielded...8365 articles yielded...8366 articles yielded...8367 articles yielded...8368 articles yielded...8369 articles yielded...8370 articles yielded...8371 articles yielded...8372 articles yielded...8373 articles yielded...8374 articles yielded...8375 articles yielded...8376 articles yielded...8377 articles yielded...8378 articles yielded...

8804 articles yielded...8805 articles yielded...8806 articles yielded...8807 articles yielded...8808 articles yielded...8809 articles yielded...8810 articles yielded...8811 articles yielded...8812 articles yielded...8813 articles yielded...8814 articles yielded...8815 articles yielded...8816 articles yielded...8817 articles yielded...8818 articles yielded...8819 articles yielded...8820 articles yielded...8821 articles yielded...8822 articles yielded...8823 articles yielded...8824 articles yielded...8825 articles yielded...8826 articles yielded...8827 articles yielded...8828 articles yielded...8829 articles yielded...8830 articles yielded...8831 articles yielded...8832 articles yielded...8833 articles yielded...8834 articles yielded...8835 articles yielded...8836 articles yielded...8837 articles yielded...8838 articles yielded...8839 articles yielded...8840 articles yielded...8841 articles yielded...8842 articles yielded...8843 articles yielded...

9311 articles yielded...9312 articles yielded...9313 articles yielded...9314 articles yielded...9315 articles yielded...9316 articles yielded...9317 articles yielded...9318 articles yielded...9319 articles yielded...9320 articles yielded...9321 articles yielded...9322 articles yielded...9323 articles yielded...9324 articles yielded...9325 articles yielded...9326 articles yielded...9327 articles yielded...9328 articles yielded...9329 articles yielded...9330 articles yielded...9331 articles yielded...9332 articles yielded...9333 articles yielded...9334 articles yielded...9335 articles yielded...9336 articles yielded...9337 articles yielded...9338 articles yielded...9339 articles yielded...9340 articles yielded...9341 articles yielded...9342 articles yielded...9343 articles yielded...9344 articles yielded...9345 articles yielded...9346 articles yielded...9347 articles yielded...9348 articles yielded...9349 articles yielded...9350 articles yielded...

1786970 articles yielded...Finished finding Bigrams!
Model Saved!


['markov_chain',
 'mont_carlo',
 'fourier_transform',
 'noncommut_algebra',
 'rectifi',
 'set']

# Word2Vec Embedding

In this section we learn the Word2Vec Embeddings. Again, we use the model from Gensim. First, we make a generator that will feed the processed abstracts to the model.

We only use the abstracts from Arxiv to learn the vectors. We don't use the mathscinet title since a) there is less text and b) the Arxiv abstracts contain terminology from many different fields other than mathematics, which is important if we want to identify whether a title is applied. 

The dimension of the word2vec embeddings will be set to 256. We tried with higher values like 512, but PCA showed that the explained variance began to plateau aound 200, so it didn't seem like we would gain much from a larger ambient space.



In [9]:
class get_bigrammed_texts:
    def __init__(self, file_name = "processed_abstracts.csv", 
                 column = "processed_abstract", 
                 bigrammer=bigrammer,
                preprocessed = True):
        
        self.file_name = file_name
        self.column = column
        self.epoch=0
        
        if preprocessed == True:
            self.process = lambda x: bigrammer[x.split(" ")]
        else:
            self.preprocess = lambda x : bigrammer[get_sring_tokens(x)]

    
    def __iter__(self):
        
        self.epoch+=1
        
        i=0
        
        df = pd.read_csv(self.file_name,chunksize=10000)
        
        for chunk in df:
            for token in chunk[self.column]:
                yield self.process(token)
                print(f"\r{i} articles of epoch {self.epoch}...",end="")
                i+=1
                
                

bigram_generator = get_bigrammed_texts()



Now we feed this generator into the Word2Vec Model

In [10]:



#Again, if a bigram_generator is given, it will learn the word2vec representations and save the model
#as file_name; otherwise it will load a previously saved model from file_name
#We can also pass other kwargs to the Gensim's Word2Vec model, like the number of epochs, window size, etc.

def get_word2vec(file_name, bigram_generator=None, **kwargs):
    
    if bigram_generator==None:
        
        model = gensim.models.Word2Vec.load(file_name)
        print("File loaded!")
        
        return model
    
    else:
        print("Learning vectors...")
        
        model = gensim.models.word2vec.Word2Vec(sentences=bigram_generator, **kwargs)
        
        print("Learned vectors!")
        
        model.save(file_name)
        
        print("Model saved as " + file_name)
        
        return model
    
wv_model = get_word2vec(file_name = "word2vec.model",window=10, sg=1, size=256, bigram_generator = bigram_generator)


Learning vectors...
1786969 articles of epoch 6...Learned vectors!
Model saved as word2vec.model


Again, as a sanity check, let's look at the most similar vectors to some common terms. Note that there are still some misspellings of schrodinger that have been created because of accents getting turned into spaces, but the bigrammer still groups together the broken pieces into bigrams and word2vec still recognizes them as being similar words. 

In [13]:
def process_word(string):
    return " ".join(bigrammer[get_string_tokens(string)])

for word in ["schrodinger","monte carlo", "metric space","hausdorff dimension", ]:
    print(wv_model.most_similar(process_word(word)))

  """


[('schroeding', 0.8230994939804077), ('schroding_equat', 0.7191005945205688), ('schr_oding', 0.7140465974807739), ('schroeding_equat', 0.692336916923523), ('schrdinger', 0.6703927516937256), ('schoeding', 0.6639249324798584), ('schoding', 0.6581615209579468), ('stargenvalu', 0.6550089120864868), ('shroding', 0.6546859741210938), ('doebner_goldin', 0.6540631055831909)]
[('mc_simul', 0.733250081539154), ('simul', 0.7229532599449158), ('rathsman', 0.6707162857055664), ('montecarlo', 0.6619164943695068), ('carlo_mc', 0.6618112325668335), ('ekhara', 0.658603847026825), ('herwig_mont', 0.6425133943557739), ('gener_phokhara', 0.6417258977890015), ('qcdin', 0.6397531032562256), ('multimagnet', 0.6395081281661987)]
[('urysohn_univers', 0.7157913446426392), ('yoneda_complet', 0.6919934749603271), ('equip_wasserstein', 0.6906655430793762), ('pseudometr_space', 0.6848750114440918), ('urysohn_metric', 0.6839586496353149), ('ultraextens', 0.6832410097122192), ('isometr_emb', 0.6811697483062744), ('n

# The Mathscinet data

I won't make the scraped mathscinet data here, but the preprocessing of the titles is exactly the same. However, we shold mention how we did the labeling. 

Mathscinet classifies papers according to the MSC classification system. Below we list the main categories (minus the general mathematics category) and partition them into pure and applied.

In [1]:
applied = [
       
       'probability theory and stochastic processes', 'statistics',
       'numerical analysis', 'computer science',
       'mechanics of particles and systems',
       'mechanics of deformable solids', 'fluid mechanics',
       'optics, electromagnetic theory',
       'classical thermodynamics, heat transfer', 
       'statistical mechanics, structure of matter',
       'astronomy and astrophysics', 'geophysics',
       'operations research, mathematical programming',
       'game theory, economics, social and behavioral sciences',
       'biology and other natural sciences', 'systems theory; control',
       'information and communication, circuits', 'mathematics education',      
        'calculus of variations and optimal control; optimization']

pure = ['difference and functional equations','history and biography', 'dynamical systems and ergodic theory', 
        'mathematical logic and foundations',
       'combinatorics', 'order, lattices, ordered algebraic structures',
       'general algebraic systems', 'number theory',
       'field theory and polynomials', 'commutative rings and algebras',
       'algebraic geometry',
       'linear and multilinear algebra; matrix theory',
       'associative rings and algebras',
       'nonassociative rings and algebras',
       'category theory; homological algebra', '$k$-theory',
       'group theory and generalizations',
       'topological groups, lie groups', 'real functions',
       'measure and integration', 'functions of a complex variable',
       'potential theory',
       'several complex variables and analytic spaces',
       'special functions (33-xx deals with the properties of functions as functions)',
       'ordinary differential equations',
       'partial differential equations','relativity and gravitational theory', 'quantum theory', 'geometry', 'convex and discrete geometry',
       'differential geometry', 'general topology', 'algebraic topology',
       'manifolds and cell complexes',
       'global analysis, analysis on manifolds', 'sequences, series, summability', 'approximations and expansions',
       'fourier analysis', 'abstract harmonic analysis',
       'integral transforms, operational calculus', 'integral equations',
       'functional analysis', 'operator theory']





Again, there isn't a clear cut between pure and applied. For example, "abstract harmonic analysis" is unambiguously pure, but "calculus of variations and optimal control" is a mix (with some very pure results and very applied results in the same category), but we'll play it safe and classify that as applied. 

The last preprocessing we did with the mathscinet database is filter out non-English titues using Googles langdetect package. 