# Reccomender system

#### Once we have our data base in vectorised form we can also provide a way to provide articles that are similar to the one that was given as an input. This can be useful for authors at it might point out articles that are related with their work and that they were not aware of. As a first attempt to solve this type of problem we can use cosine similarity.

#### The code contained here is basically the same contained in the module model.py

In [1]:
import pickle
import os
import re
import numpy as np
from random import shuffle
import shutil
from  urllib.request import urlopen


from utils import Config, read_clean

from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
with open(Config.tfidf,'rb') as file:
    tfidf=pickle.load(file)
with open(Config.vectorized_articles,'rb') as file:
    vectorized_articles=pickle.load(file)

In [3]:
X=vectorized_articles['X']

In [4]:
#Function that return the articles in the database X that are closer to the given article x based on cos similarity
def find_similar(X,x,how_many):
    cos_similarity=x.dot(X.transpose())
    cos_similarity=np.asarray(cos_similarity.todense())
    return np.argsort(-cos_similarity)[:,:how_many]

#### We can see an example on how this works on the first two articles of the database.

In [5]:
print(vectorized_articles['articles'][find_similar(X,X[0:2,:],10)])
print(vectorized_articles['links'][find_similar(X,X[0:2,:],10)])

[['2003.01771' '1902.08464' '1807.09515' '1901.07357' '1910.00583'
  '0910.5001' '1904.01053' '2002.06177' '1711.00678' '1911.01756']
 ['1908.02651' '2001.01266' '2002.08316' '1908.02640' '1907.00271'
  '1902.06655' '1906.00478' '1911.03282' '1911.08356' '1906.09380']]
[['http://arxiv.org/pdf/2003.01771v1' 'http://arxiv.org/pdf/1902.08464v3'
  'http://arxiv.org/pdf/1807.09515v1' 'http://arxiv.org/pdf/1901.07357v1'
  'http://arxiv.org/pdf/1910.00583v2' 'http://arxiv.org/pdf/0910.5001v1'
  'http://arxiv.org/pdf/1904.01053v1' 'http://arxiv.org/pdf/2002.06177v3'
  'http://arxiv.org/pdf/1711.00678v1' 'http://arxiv.org/pdf/1911.01756v2']
 ['http://arxiv.org/pdf/1908.02651v3' 'http://arxiv.org/pdf/2001.01266v2'
  'http://arxiv.org/pdf/2002.08316v1' 'http://arxiv.org/pdf/1908.02640v1'
  'http://arxiv.org/pdf/1907.00271v1' 'http://arxiv.org/pdf/1902.06655v1'
  'http://arxiv.org/pdf/1906.00478v3' 'http://arxiv.org/pdf/1911.03282v2'
  'http://arxiv.org/pdf/1911.08356v1' 'http://arxiv.org/pdf/1906

## Evaluation of the reccomender system

#### By looking the articles given by the reccomender system it can be seen that they seem to be very similar to the one given as an input. Anyhow we would like to have a more quantitative measurement on how good our reccomender system is. 
#### In general this is not an easy problem, as we do not know if the reccomended articles are relevant or not, as it would require to read the articles, which for large number of articles is clearly not feasible. One possible approach is then to consider as relevant all the articles that are cited by the paper that we are considering. Since it might be possible that not all the articles cited by artices in the database are also part of the database we will need to download the cited articles
#### Unfortunately most of the time cited articles are not on free acess and this would render it difficult to download them. For this reason we will only consider citations to arXiv articles, which we can easily download. The following approach, therefore, present evident limitations, which could be overcame or at least reduced with more accessible resources. 
#### Another point to be consider is that sometimes not all the articles that are being cited are relevant, these type of complications could be possibly be resolved through some data labeling, wich we will not perform here.

#### Let's therefore start by selecting a set of articles on which we will perform the analysis

In [6]:
with open(Config.metadata_db,'rb') as file:
    metadata_db=pickle.load(file)

txt_labels_train=[]

for article in metadata_db.keys():
    article_path=os.path.join(Config.txt_db,article+'.txt')
    if os.path.isfile(article_path):
        txt_labels_train.append(article)    

In [7]:
txt_labels_shuffle=list(txt_labels_train) #Hard copy for the set of articles
shuffle(txt_labels_shuffle)
#We use the read_clean function as it will remove the citation to the paper itsels
max_articles=7000
idxs=txt_labels_shuffle[:max_articles]

In [8]:
citations_db={}
for idx in idxs:
    with open(os.path.join(Config.txt_db,idx+'.txt'),'r') as file:
        article=file.read()
    citations=re.findall('arXiv:[0-9]+.[0-9]+|arXiv:[a-zA-Z-]+.[0-9]+|arXiv:[a-zA-Z-]+.[0-9]+.[0-9]',article)
    if len(citations)>1:
        citations_db[idx]=citations[1:] # the first citation is generally the paper itself

#### We can now see what the citations look like and save them.

In [9]:
print(citations_db)

{'2001.03896': ['arXiv:1808.00773'], '2001.02756': ['arXiv:1811.03722', 'arXiv:quantph/0209083', 'arXiv:1712.02589', 'arXiv:quantph/0504051'], '1911.12231': ['arXiv:1809.07609', 'arXiv:1706.04702', 'arXiv:1905.05027', 'arXiv:1502.03167', 'arXiv:1412.6980', 'arXiv:1907.10578', 'arXiv:1807.06622', 'arXiv:1904.05921'], '1901.10393': ['arXiv:1909.00598', 'arXiv:1802.07075', 'arXiv:1809.02536', 'arXiv:1210.4034', 'arXiv:1409.2191', 'arXiv:1507.04951'], '1904.11854': ['arXiv:1604.08534'], '1712.08155': ['arXiv:nlin/0604022', 'arXiv:1706.02873', 'arXiv:1501.01955', 'arXiv:1008.1575'], '1907.08286': ['arXiv:1711.07866'], '2001.01916': ['arXiv:1603.04467', 'arXiv:1511.06435', 'arXiv:1512.01274', 'arXiv:1802.05799'], '2003.04995': ['arXiv:1801.10612'], '2002.06684': ['arXiv:1602.02672', 'arXiv:1805.08776', 'arXiv:1703.04908', 'arXiv:1812.09755'], '1902.08185': ['arXiv:1507.07956', 'arXiv:1301.0905', 'arXiv:1502.05314', 'arXiv:1809.09635', 'arXiv:1812.02714', 'arXiv:1509.05644', 'arXiv:1509.06676

In [10]:
# Save the set of citations
with open(Config.citations_db,'wb') as file:
    pickle.dump(citations_db,file)

#### We can now download the articles. To do this we will tweak a bit the script download_to_text.py

In [13]:
def get_id_url(citation):
    idx=citation[citation.rfind(':')+1:].split('v')[0]
    control=idx.rfind('/')
    if control!=-1:
        idx=idx[control+1:]
    url='http://export.'+citation.replace(':','.org/pdf/')+'.pdf'
    return idx,url.lower()

if not os.path.exists(Config.tmp): #create directory to temporarily store pdfs if not present aready
    os.makedirs(Config.tmp)

timeout=10 #waiting seconds before stopping the download
already_have = set(os.listdir(Config.txt_db)) #getting list of papers that are already present in the directory  

num_to_add=0
num_added=0
with open('citations_db','rb') as file:
    citations_db=pickle.load(file)
    
print('Starting the script')

for citations in citations_db.values():
    
    for citation in citations:
        idx,pdf_url=get_id_url(citation)
        txt=idx+'.txt'
        pdf=idx+'.pdf'
        pdf_path=os.path.join(Config.tmp,pdf)
        txt_path=os.path.join(Config.txt_db,txt)
        try:
            if not txt in already_have:
                num_to_add+=1
                req = urlopen(pdf_url, None, timeout)
                print('Getting article %s' % (pdf_url))
                with open(pdf_path, 'wb') as file:
                    shutil.copyfileobj(req, file)
                #converting the pdf into txt needs pdftotext on the system to run
                cmd = "pdftotext %s %s" % (pdf_path, txt_path)
                exit=os.system(cmd)
                #remove the pdf to save space
                os.system('rm %s'%(pdf_path))
                num_added+=1
                #check that everything went well
                if exit!=0:
                    print('It seems like there was an error in converting %s. Please try again later. Exit status %i.'%(pdf,exit))
                    #remove the article in case the file was created
                    if os.path.isfile(txt_path):
                        os.system('rm '+txt_path)
                    num_added-=1
            
            else:
                print('%s already exists, skipping.' % (idx))
    
        except Exception as e:
            print('An error incurred while downloading: %s .'%(pdf_url))
            print(e)
print('Downloaded %i articles out of %i.'%(num_added,num_to_add))    

Starting the script
1710.01673 already exists, skipping.
1801.09003 already exists, skipping.
1901.04385 already exists, skipping.
1412.6980 already exists, skipping.
1902.04367 already exists, skipping.
1905.09474 already exists, skipping.
1809.10716 already exists, skipping.
1905.04852 already exists, skipping.
1801.06416 already exists, skipping.
1904.12442 already exists, skipping.
1905.05371 already exists, skipping.
1708.08796 already exists, skipping.
1812.08486 already exists, skipping.
1904.08351 already exists, skipping.
1702.06579 already exists, skipping.
1809.10612 already exists, skipping.
1801.07200 already exists, skipping.
1812.11143 already exists, skipping.
0812.5023 already exists, skipping.
1607.03470 already exists, skipping.
1703.09312 already exists, skipping.
0509602 already exists, skipping.
1603.04467 already exists, skipping.
1412.6980 already exists, skipping.
1701.01687 already exists, skipping.
1711.02488 already exists, skipping.
1711.00591 already exist

Getting article http://export.arxiv.org/pdf/gr-qc/0410093.pdf
It seems like there was an error in converting 0410093.pdf. Please try again later. Exit status 256.
1910.09883 already exists, skipping.
1709.05653 already exists, skipping.
0708.0034 already exists, skipping.
1902.01430 already exists, skipping.
1812.09742 already exists, skipping.
1706.04158 already exists, skipping.
1511.01071 already exists, skipping.
1701.01040 already exists, skipping.
1611.01087 already exists, skipping.
1406.4513 already exists, skipping.
1704.03578 already exists, skipping.
1203.6543 already exists, skipping.
1707.06453 already exists, skipping.
0807.3243 already exists, skipping.
1201.4330 already exists, skipping.
1204.2314 already exists, skipping.
1206.4740 already exists, skipping.
1602.02410 already exists, skipping.
1606.04934 already exists, skipping.
1701.08155 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/hep-ph/0801.pdf .
HTTP Error 404: Not F

An error incurred while downloading: http://export.arxiv.org/pdf/180506084.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/181211699.pdf .
HTTP Error 404: Not Found
1812.10226 already exists, skipping.
1803.07177 already exists, skipping.
1903.03427 already exists, skipping.
1807.00652 already exists, skipping.
1412.6980 already exists, skipping.
1801.07791 already exists, skipping.
1612.00593 already exists, skipping.
1802.09987 already exists, skipping.
1804.01654 already exists, skipping.
1607.07680 already exists, skipping.
1912.04312 already exists, skipping.
1601.03532 already exists, skipping.
0706.0622 already exists, skipping.
1411.2909 already exists, skipping.
1501.02188 already exists, skipping.
1701.00037 already exists, skipping.
1411.2030 already exists, skipping.
1807.08313 already exists, skipping.
1904.06553 already exists, skipping.
0404008 already exists, skipping.
0405125 already exists, skipping.
9204099 already exi

An error incurred while downloading: http://export.arxiv.org/pdf/1909.org/pdf/13837.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/1909.org/pdf/13837.pdf .
HTTP Error 404: Not Found
1702.07779 already exists, skipping.
1803.10829 already exists, skipping.
1810.06933 already exists, skipping.
1804.04794 already exists, skipping.
1604.00460 already exists, skipping.
1611.02675 already exists, skipping.
1105.1827 already exists, skipping.
1507.01988 already exists, skipping.
1404.2761 already exists, skipping.
1407.0334 already exists, skipping.
1811.04942 already exists, skipping.
1901.09925 already exists, skipping.
0408151 already exists, skipping.
1809.05523 already exists, skipping.
2001.05109 already exists, skipping.
1911.04709 already exists, skipping.
1912.08379 already exists, skipping.
1910.11343 already exists, skipping.
2001.10275 already exists, skipping.
1808.06670 already exists, skipping.
1502.03167 already exists, skippin

An error incurred while downloading: http://export.arxiv.org/pdf/1907.pdf .
HTTP Error 404: Not Found
1611.05243 already exists, skipping.
1611.09806 already exists, skipping.
1908.03969 already exists, skipping.
1909.07253 already exists, skipping.
1910.01395 already exists, skipping.
1811.11479 already exists, skipping.
1812.01336 already exists, skipping.
1804.08367 already exists, skipping.
1804.08367 already exists, skipping.
1804.08367 already exists, skipping.
1612.07618 already exists, skipping.
1810.10971 already exists, skipping.
1410.3394 already exists, skipping.
1905.00728 already exists, skipping.
1905.00711 already exists, skipping.
1809.09466 already exists, skipping.
1602.04946 already exists, skipping.
1610.05131 already exists, skipping.
1605.01936 already exists, skipping.
1207.7214 already exists, skipping.
1207.7235 already exists, skipping.
1303.4571 already exists, skipping.
1503.07589 already exists, skipping.
1706.09936 already exists, skipping.
0207010 alread

An error incurred while downloading: http://export.arxiv.org/pdf/1406.26369.pdf .
HTTP Error 404: Not Found
1906.11988 already exists, skipping.
1410.3831 already exists, skipping.
1910.09658 already exists, skipping.
2003.03777 already exists, skipping.
2002.01038 already exists, skipping.
2001.07620 already exists, skipping.
1905.04497 already exists, skipping.
1511.06434 already exists, skipping.
1701.07875 already exists, skipping.
1710.10196 already exists, skipping.
1810.10863 already exists, skipping.
1409.1556 already exists, skipping.
1607.08022 already exists, skipping.
1412.6980 already exists, skipping.
1604.02768 already exists, skipping.
1703.03947 already exists, skipping.
1911.08266 already exists, skipping.
1711.08395 already exists, skipping.
1711.03145 already exists, skipping.
1607.05756 already exists, skipping.
1607.07884 already exists, skipping.
1707.05362 already exists, skipping.
1707.05181 already exists, skipping.
1810.00149 already exists, skipping.
1810.12

An error incurred while downloading: http://export.arxiv.org/pdf/170304093.pdf .
HTTP Error 404: Not Found
1311.0208 already exists, skipping.
1610.04837 already exists, skipping.
1609.01891 already exists, skipping.
1011.1690 already exists, skipping.
1809.00913 already exists, skipping.
1711.03213 already exists, skipping.
1612.02649 already exists, skipping.
1709.01507 already exists, skipping.
1603.04779 already exists, skipping.
1711.06969 already exists, skipping.
1412.3474 already exists, skipping.
1412.4869 already exists, skipping.
1407.3490 already exists, skipping.
1504.05063 already exists, skipping.
9902056 already exists, skipping.
9902015 already exists, skipping.
0008064 already exists, skipping.
0203055 already exists, skipping.
0904.2227 already exists, skipping.
0208023 already exists, skipping.
0603045 already exists, skipping.
1610.04545 already exists, skipping.
9509371 already exists, skipping.
1510.07598 already exists, skipping.
1805.06056 already exists, skipp

An error incurred while downloading: http://export.arxiv.org/pdf/170301777.pdf .
HTTP Error 404: Not Found
1802.01561 already exists, skipping.
1902.06865 already exists, skipping.
1611.05397 already exists, skipping.
1910.07478 already exists, skipping.
1912.06910 already exists, skipping.
1909.11583 already exists, skipping.
1911.08265 already exists, skipping.
1711.09846 already exists, skipping.
1505.00853 already exists, skipping.
1612.09465 already exists, skipping.
1805.04514 already exists, skipping.
1409.8188 already exists, skipping.
0711.0149 already exists, skipping.
1905.06758 already exists, skipping.
1409.1556 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1602.pdf .
HTTP Error 404: Not Found
1904.01099 already exists, skipping.
1510.00149 already exists, skipping.
1409.1556 already exists, skipping.
1703.07737 already exists, skipping.
1811.00626 already exists, skipping.
1705.07771 already exists, skipping.
1612

An error incurred while downloading: http://export.arxiv.org/pdf/14104771.pdf .
HTTP Error 404: Not Found
1504.03145 already exists, skipping.
1604.07316 already exists, skipping.
1803.09156 already exists, skipping.
1806.00667 already exists, skipping.
1707.05572 already exists, skipping.
1905.03828 already exists, skipping.
1908.03173 already exists, skipping.
1707.05373 already exists, skipping.
1801.01944 already exists, skipping.
1412.5567 already exists, skipping.
1801.00554 already exists, skipping.
1607.02533 already exists, skipping.
1804.03209 already exists, skipping.
1412.6572 already exists, skipping.
1905.02175 already exists, skipping.
1705.09554 already exists, skipping.
1701.04862 already exists, skipping.
1701.07875 already exists, skipping.
1805.06576 already exists, skipping.
1808.10356 already exists, skipping.
1803.05649 already exists, skipping.
1803.07819 already exists, skipping.
1707.05776 already exists, skipping.
1804.00891 already exists, skipping.
1410.851

An error incurred while downloading: http://export.arxiv.org/pdf/hep-lat/1111.pdf .
HTTP Error 404: Not Found
1803.07081 already exists, skipping.
0804.2509 already exists, skipping.
1812.00047 already exists, skipping.
1806.10528 already exists, skipping.
1812.04091 already exists, skipping.
1709.08763 already exists, skipping.
1906.05178 already exists, skipping.
1711.08013 already exists, skipping.
1807.08086 already exists, skipping.
1510.07291 already exists, skipping.
1905.06753 already exists, skipping.
1912.02846 already exists, skipping.
2001.00661 already exists, skipping.
1511.04297 already exists, skipping.
1710.05605 already exists, skipping.
1711.09908 already exists, skipping.
1604.01356 already exists, skipping.
1605.02760 already exists, skipping.
1801.05814 already exists, skipping.
1804.04463 already exists, skipping.
1811.04456 already exists, skipping.
1004.4959 already exists, skipping.
1309.5681 already exists, skipping.
1402.1558 already exists, skipping.
1709.0

An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1802.pdf .
HTTP Error 404: Not Found
1812.03712 already exists, skipping.
1803.04387 already exists, skipping.
1805.07988 already exists, skipping.
1302.5555 already exists, skipping.
1607.05188 already exists, skipping.
1905.00123 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/11907.02614.pdf .
HTTP Error 404: Not Found
1907.12559 already exists, skipping.
1910.00022 already exists, skipping.
1709.09066 already exists, skipping.
1903.04496 already exists, skipping.
1907.06440 already exists, skipping.
1807.06209 already exists, skipping.
1911.09614 already exists, skipping.
1907.10121 already exists, skipping.
1607.02173 already exists, skipping.
1711.00541 already exists, skipping.
1809.07454 already exists, skipping.
1705.02514 already exists, skipping.
1902.00651 already exists, skipping.
1902.04891 already exists, skipping.
1907.09884 already exists, skipping.

An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1409.pdf .
HTTP Error 404: Not Found
1807.06209 already exists, skipping.
2002.07179 already exists, skipping.
1810.13423 already exists, skipping.
1703.01369 already exists, skipping.
1709.05392 already exists, skipping.
1909.10354 already exists, skipping.
1909.06437 already exists, skipping.
1912.04392 already exists, skipping.
0806.3261 already exists, skipping.
9609018 already exists, skipping.
1211.0529 already exists, skipping.
1812.11933 already exists, skipping.
1006.1326 already exists, skipping.
1208.1500 already exists, skipping.
1209.2022 already exists, skipping.
1105.5048 already exists, skipping.
1903.05777 already exists, skipping.
1511.05226 already exists, skipping.
1512.04288 already exists, skipping.
1304.6141 already exists, skipping.
1808.00323 already exists, skipping.
1910.09740 already exists, skipping.
1807.11462 already exists, skipping.
1907.06173 already exists, skipping.
1812.0

An error incurred while downloading: http://export.arxiv.org/pdf/0411174.pdf .
HTTP Error 404: Not Found
1905.13712 already exists, skipping.
1910.10813 already exists, skipping.
1712.09119 already exists, skipping.
1903.10161 already exists, skipping.
1606.01332 already exists, skipping.
1910.06933 already exists, skipping.
1907.01681 already exists, skipping.
1805.04027 already exists, skipping.
1908.05996 already exists, skipping.
1903.08931 already exists, skipping.
1907.00575 already exists, skipping.
1906.06014 already exists, skipping.
0808.0017 already exists, skipping.
1905.02450 already exists, skipping.
1902.04864 already exists, skipping.
1908.09972 already exists, skipping.
1808.06414 already exists, skipping.
1607.06450 already exists, skipping.
1409.0473 already exists, skipping.
1810.04805 already exists, skipping.
1706.03847 already exists, skipping.
1511.06939 already exists, skipping.
1412.2007 already exists, skipping.
1610.10099 already exists, skipping.
1412.6980 

An error incurred while downloading: http://export.arxiv.org/pdf/0505265.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/hep-ph/1807.pdf .
HTTP Error 404: Not Found
1907.07100 already exists, skipping.
1904.10341 already exists, skipping.
1810.11909 already exists, skipping.
1506.02169 already exists, skipping.
1708.01974 already exists, skipping.
1406.2661 already exists, skipping.
1903.04057 already exists, skipping.
1810.06433 already exists, skipping.
1610.03483 already exists, skipping.
1805.07226 already exists, skipping.
1707.05987 already exists, skipping.
1912.05810 already exists, skipping.
1611.10242 already exists, skipping.
1909.03435 already exists, skipping.
1312.0544 already exists, skipping.
1608.02997 already exists, skipping.
1706.07628 already exists, skipping.
1810.10137 already exists, skipping.
1309.0260 already exists, skipping.
1703.05132 already exists, skipping.
1612.08138 already exists, skipping.
1609.02108 a

An error incurred while downloading: http://export.arxiv.org/pdf/1711.pdf .
HTTP Error 404: Not Found
0412012 already exists, skipping.
0710.3791 already exists, skipping.
1807.01924 already exists, skipping.
1806.07403 already exists, skipping.
1705.05802 already exists, skipping.
1512.00322 already exists, skipping.
9711021 already exists, skipping.
1103.2987 already exists, skipping.
1703.02508 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/1612.pdf .
HTTP Error 404: Not Found
0306138 already exists, skipping.
1406.6482 already exists, skipping.
1508.07749 already exists, skipping.
1703.05747 already exists, skipping.
1901.03584 already exists, skipping.
1705.07933 already exists, skipping.
1705.07935 already exists, skipping.
1705.07917 already exists, skipping.
1512.04442 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/1606.pdf .
HTTP Error 404: Not Found
1609.04736 already exists, skipping.
190

An error incurred while downloading: http://export.arxiv.org/pdf/0604447.pdf .
HTTP Error 404: Not Found
1908.05643 already exists, skipping.
1908.05643 already exists, skipping.
1607.08261 already exists, skipping.
1705.02387 already exists, skipping.
1308.6253 already exists, skipping.
1811.04968 already exists, skipping.
1607.08535 already exists, skipping.
1809.09697 already exists, skipping.
1304.3061 already exists, skipping.
1411.4028 already exists, skipping.
1512.01098 already exists, skipping.
1807.08768 already exists, skipping.
2001.04060 already exists, skipping.
1604.01401 already exists, skipping.
1304.3390 already exists, skipping.
1507.01902 already exists, skipping.
1612.08091 already exists, skipping.
1905.11349 already exists, skipping.
1901.11054 already exists, skipping.
1612.04929 already exists, skipping.
1909.07522 already exists, skipping.
1904.04735 already exists, skipping.
1710.07629 already exists, skipping.
1810.10523 already exists, skipping.
1911.04630 

An error incurred while downloading: http://export.arxiv.org/pdf/9903031v1.pdf .
HTTP Error 404: Not Found
1203.3195 already exists, skipping.
1712.01396 already exists, skipping.
1901.00507 already exists, skipping.
1709.01091 already exists, skipping.
1903.03973 already exists, skipping.
1705.06655 already exists, skipping.
1810.01295 already exists, skipping.
1709.01525 already exists, skipping.
1404.1949 already exists, skipping.
1410.2602 already exists, skipping.
1701.00037 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/9807001.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/9703084.pdf .
HTTP Error 404: Not Found
1908.01886 already exists, skipping.
1811.01260 already exists, skipping.
1904.07710 already exists, skipping.
0904.3575 already exists, skipping.
1910.11846 already exists, skipping.
1808.01559 already exists, skipping.
2002.09486 already exists, skipping.
2002.01676 already ex

An error incurred while downloading: http://export.arxiv.org/pdf/12116667.pdf .
HTTP Error 404: Not Found
1906.06162 already exists, skipping.
1606.05464 already exists, skipping.
1607.01759 already exists, skipping.
1412.6980 already exists, skipping.
1403.7027 already exists, skipping.
2001.05777 already exists, skipping.
1307.5458 already exists, skipping.
1409.0050 already exists, skipping.
1606.09243 already exists, skipping.
0903.3630 already exists, skipping.
1901.08876 already exists, skipping.
1606.07001 already exists, skipping.
1506.08309 already exists, skipping.
1607.01468 already exists, skipping.
1602.05300 already exists, skipping.
2002.07499 already exists, skipping.
1807.07169 already exists, skipping.
1106.1613 already exists, skipping.
0207035 already exists, skipping.
1102.2688 already exists, skipping.
1510.08127 already exists, skipping.
1507.05613 already exists, skipping.
0903.5323 already exists, skipping.
1805.12562 already exists, skipping.
1708.06917 alread

An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1011.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/0103615.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/10.1021.pdf .
HTTP Error 404: Not Found
1901.05033 already exists, skipping.
1712.00669 already exists, skipping.
1804.07904 already exists, skipping.
1909.04510 already exists, skipping.
1011.3914 already exists, skipping.
0706.3356 already exists, skipping.
1806.08708 already exists, skipping.
1806.02295 already exists, skipping.
1910.12400 already exists, skipping.
1909.02508 already exists, skipping.
1811.05519 already exists, skipping.
1812.02820 already exists, skipping.
1906.11413 already exists, skipping.
1911.00237 already exists, skipping.
1911.08507 already exists, skipping.
1711.03145 already exists, skipping.
1707.05181 already exists, skipping.
1707.05362 already exists, skipping.
1705.07570

An error incurred while downloading: http://export.arxiv.org/pdf/9405016v1.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/9308022.pdf .
HTTP Error 404: Not Found
1706.06223 already exists, skipping.
1710.03529 already exists, skipping.
1910.11745 already exists, skipping.
1804.03528 already exists, skipping.
1912.01600 already exists, skipping.
1912.09872 already exists, skipping.
1704.02518 already exists, skipping.
1712.05969 already exists, skipping.
1603.04467 already exists, skipping.
1412.6980 already exists, skipping.
1510.04430 already exists, skipping.
0701193 already exists, skipping.
1301.0537 already exists, skipping.
1404.7736 already exists, skipping.
1907.06664 already exists, skipping.
1906.04090 already exists, skipping.
1903.12546 already exists, skipping.
1911.12461 already exists, skipping.
1909.04337 already exists, skipping.
1909.00266 already exists, skipping.
1906.07603 already exists, skipping.
1910.04126 alread

An error incurred while downloading: http://export.arxiv.org/pdf/1803.pdf .
HTTP Error 404: Not Found
1901.05184 already exists, skipping.
1803.00652 already exists, skipping.
1809.05513 already exists, skipping.
1911.09289 already exists, skipping.
1911.07705 already exists, skipping.
1705.08926 already exists, skipping.
1711.03817 already exists, skipping.
1802.06444 already exists, skipping.
1803.11485 already exists, skipping.
1905.07574 already exists, skipping.
1412.7584 already exists, skipping.
1412.7584 already exists, skipping.
1904.10899 already exists, skipping.
1905.07279 already exists, skipping.
1612.06926 already exists, skipping.
1608.06279 already exists, skipping.
1702.07513 already exists, skipping.
1102.0647 already exists, skipping.
1608.04121 already exists, skipping.
1702.02315 already exists, skipping.
1103.1479 already exists, skipping.
0911.3972 already exists, skipping.
1905.11083 already exists, skipping.
1909.12817 already exists, skipping.
1201.4145 alrea

An error incurred while downloading: http://export.arxiv.org/pdf/14075001.pdf .
HTTP Error 404: Not Found
1712.03853 already exists, skipping.
1606.04855 already exists, skipping.
1709.09660 already exists, skipping.
1710.05832 already exists, skipping.
1807.00388 already exists, skipping.
1906.05673 already exists, skipping.
1908.04261 already exists, skipping.
1912.05194 already exists, skipping.
2001.03777 already exists, skipping.
2001.03777 already exists, skipping.
1107.0685 already exists, skipping.
1110.6145 already exists, skipping.
1705.00736 already exists, skipping.
1210.4664 already exists, skipping.
1309.3219 already exists, skipping.
0206094 already exists, skipping.
1407.6735 already exists, skipping.
1406.1744 already exists, skipping.
0404003 already exists, skipping.
9606010 already exists, skipping.
9702015 already exists, skipping.
0504437 already exists, skipping.
9709040 already exists, skipping.
0011041 already exists, skipping.
1612.02254 already exists, skippi

An error incurred while downloading: http://export.arxiv.org/pdf/1809.pdf .
HTTP Error 404: Not Found
1803.04247 already exists, skipping.
1912.13483 already exists, skipping.
1708.05574 already exists, skipping.
1908.05574 already exists, skipping.
1810.13134 already exists, skipping.
1907.02671 already exists, skipping.
1907.02671 already exists, skipping.
0708.1575 already exists, skipping.
1702.05466 already exists, skipping.
1603.08472 already exists, skipping.
1812.00366 already exists, skipping.
1907.09740 already exists, skipping.
1603.08472 already exists, skipping.
1903.03615 already exists, skipping.
1810.00516 already exists, skipping.
1412.6980 already exists, skipping.
1812.01739 already exists, skipping.
1903.06379 already exists, skipping.
1609.08675 already exists, skipping.
1703.07737 already exists, skipping.
1502.03167 already exists, skipping.
1304.5634 already exists, skipping.
1709.05584 already exists, skipping.
1412.6980 already exists, skipping.
1611.07308 alr

An error incurred while downloading: http://export.arxiv.org/pdf/1901.075041.pdf .
HTTP Error 404: Not Found
1904.01058 already exists, skipping.
1905.09822 already exists, skipping.
1810.05601 already exists, skipping.
1912.09961 already exists, skipping.
1905.06295 already exists, skipping.
2002.00869 already exists, skipping.
1209.5145 already exists, skipping.
1810.06175 already exists, skipping.
1906.11812 already exists, skipping.
1811.02699 already exists, skipping.
1910.07827 already exists, skipping.
1509.02417 already exists, skipping.
1912.07411 already exists, skipping.
1612.05560 already exists, skipping.
1410.2596 already exists, skipping.
1601.04790 already exists, skipping.
1910.07140 already exists, skipping.
9312193 already exists, skipping.
9802054 already exists, skipping.
1611.02813 already exists, skipping.
1803.04212 already exists, skipping.
1605.04777 already exists, skipping.
1509.02869 already exists, skipping.
1809.08238 already exists, skipping.
1911.01020 

An error incurred while downloading: http://export.arxiv.org/pdf/0601121.pdf .
HTTP Error 404: Not Found
1610.02208 already exists, skipping.
1709.06678 already exists, skipping.
1104.3835 already exists, skipping.
0810.2866 already exists, skipping.
0812.3510 already exists, skipping.
1002.1330 already exists, skipping.
1401.5780 already exists, skipping.
1505.00665 already exists, skipping.
1011.1669 already exists, skipping.
1609.09446 already exists, skipping.
1610.08841 already exists, skipping.
1207.1655 already exists, skipping.
1309.0876 already exists, skipping.
1311.5269 already exists, skipping.
1409.1524 already exists, skipping.
1703.05402 already exists, skipping.
1410.3029 already exists, skipping.
1612.05204 already exists, skipping.
1712.01850 already exists, skipping.
1802.01590 already exists, skipping.
1802.07827 already exists, skipping.
1803.11278 already exists, skipping.
1803.11278 already exists, skipping.
1807.06113 already exists, skipping.
1807.04564 already

An error incurred while downloading: http://export.arxiv.org/pdf/1812.045555.pdf .
HTTP Error 404: Not Found
1908.03714 already exists, skipping.
1611.07120 already exists, skipping.
1301.3781 already exists, skipping.
1607.05368 already exists, skipping.
1703.06587 already exists, skipping.
1707.05005 already exists, skipping.
1606.08928 already exists, skipping.
1605.03481 already exists, skipping.
1707.04596 already exists, skipping.
1801.06597 already exists, skipping.
1503.00075 already exists, skipping.
2002.00697 already exists, skipping.
1405.1956 already exists, skipping.
1405.1955 already exists, skipping.
1708.04853 already exists, skipping.
2002.03395 already exists, skipping.
0503224 already exists, skipping.
1910.12565 already exists, skipping.
1901.00144 already exists, skipping.
1111.1797 already exists, skipping.
1707.02038 already exists, skipping.
1808.07371 already exists, skipping.
1708.08649 already exists, skipping.
1806.03856 already exists, skipping.
1803.06917

An error incurred while downloading: http://export.arxiv.org/pdf/0306026.pdf .
HTTP Error 404: Not Found
1912.00395 already exists, skipping.
2001.09875 already exists, skipping.
1801.04077 already exists, skipping.
2002.02332 already exists, skipping.
9911008 already exists, skipping.
0206340 already exists, skipping.
1710.06326 already exists, skipping.
1211.5199 already exists, skipping.
1510.06871 already exists, skipping.
1909.04679 already exists, skipping.
1909.04039 already exists, skipping.
2001.06009 already exists, skipping.
2002.04340 already exists, skipping.
1909.08924 already exists, skipping.
1910.03590 already exists, skipping.
1909.04667 already exists, skipping.
1603.01171 already exists, skipping.
1903.10792 already exists, skipping.
1802.04123 already exists, skipping.
1810.12952 already exists, skipping.
1808.08760 already exists, skipping.
1806.04617 already exists, skipping.
2001.06430 already exists, skipping.
1811.02573 already exists, skipping.
1911.00380 alr

An error incurred while downloading: http://export.arxiv.org/pdf/1506.pdf .
HTTP Error 404: Not Found
2001.05997 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/dg-ga/961201.pdf .
HTTP Error 404: Not Found
1608.03355 already exists, skipping.
1703.05132 already exists, skipping.
1608.07158 already exists, skipping.
1811.03859 already exists, skipping.
1604.05932 already exists, skipping.
1712.05809 already exists, skipping.
1810.13306 already exists, skipping.
1909.09157 already exists, skipping.
1911.09103 already exists, skipping.
1908.09487 already exists, skipping.
1810.03251 already exists, skipping.
1408.5385 already exists, skipping.
1801.08509 already exists, skipping.
1612.02650 already exists, skipping.
1907.07102 already exists, skipping.
1711.03088 already exists, skipping.
1801.01371 already exists, skipping.
1904.13116 already exists, skipping.
1710.05528 already exists, skipping.
1802.04865 already exists, skipping.
1805.09155 a

An error incurred while downloading: http://export.arxiv.org/pdf/abs/1705.pdf .
HTTP Error 404: Not Found
1901.07014 already exists, skipping.
1607.06450 already exists, skipping.
1409.0473 already exists, skipping.
1712.05690 already exists, skipping.
1706.06415 already exists, skipping.
1409.0473 already exists, skipping.
1406.1078 already exists, skipping.
1703.06846 already exists, skipping.
1308.0850 already exists, skipping.
1602.06662 already exists, skipping.
1612.05231 already exists, skipping.
1710.09431 already exists, skipping.
1312.6026 already exists, skipping.
1509.08101 already exists, skipping.
1504.00941 already exists, skipping.
1804.09737 already exists, skipping.
1804.09737 already exists, skipping.
1803.10425 already exists, skipping.
1803.03841 already exists, skipping.
1712.06836 already exists, skipping.
1812.05090 already exists, skipping.
1907.00869 already exists, skipping.
1907.00869 already exists, skipping.
1111.6580 already exists, skipping.
9403051 alre

An error incurred while downloading: http://export.arxiv.org/pdf/170103341 2017.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/160400772 2016.pdf .
HTTP Error 404: Not Found
1703.01369 already exists, skipping.
2001.06885 already exists, skipping.
2002.07148 already exists, skipping.
2002.10244 already exists, skipping.
2001.05792 already exists, skipping.
0302399 already exists, skipping.
1910.05643 already exists, skipping.
1512.07820 already exists, skipping.
1610.03361 already exists, skipping.
1707.04966 already exists, skipping.
0208016 already exists, skipping.
1603.02698 already exists, skipping.
1604.03894 already exists, skipping.
1010.5788 already exists, skipping.
1304.6875 already exists, skipping.
1904.06759 already exists, skipping.
1411.4547 already exists, skipping.
1408.3978 already exists, skipping.
0901.3258 already exists, skipping.
0911.3535 already exists, skipping.
1410.8866 already exists, skipping.
1307.8338 al

An error incurred while downloading: http://export.arxiv.org/pdf/0000.0000.pdf .
HTTP Error 404: Not Found
2002.06503 already exists, skipping.
1812.08886 already exists, skipping.
1709.09657 already exists, skipping.
1306.6046 already exists, skipping.
0609140 already exists, skipping.
0308187 already exists, skipping.
0207107 already exists, skipping.
1501.00401 already exists, skipping.
9801088 already exists, skipping.
1009.5881 already exists, skipping.
1111.0910 already exists, skipping.
1306.3763 already exists, skipping.
1710.04408 already exists, skipping.
0912.5104 already exists, skipping.
1205.0975 already exists, skipping.
1301.6872 already exists, skipping.
1312.5729 already exists, skipping.
1510.05949 already exists, skipping.
1712.02280 already exists, skipping.
1811.11094 already exists, skipping.
1412.6428 already exists, skipping.
1511.02428 already exists, skipping.
1512.00815 already exists, skipping.
9912232 already exists, skipping.
1003.3953 already exists, ski

An error incurred while downloading: http://export.arxiv.org/pdf/1703.pdf .
HTTP Error 404: Not Found
1209.4505 already exists, skipping.
1802.07203 already exists, skipping.
0907.4219 already exists, skipping.
1503.07631 already exists, skipping.
1704.01848 already exists, skipping.
1405.2247 already exists, skipping.
1201.3565 already exists, skipping.
1804.02044 already exists, skipping.
1608.01304 already exists, skipping.
1608.02495 already exists, skipping.
9606040 already exists, skipping.
1909.02296 already exists, skipping.
1706.07496 already exists, skipping.
1708.09837 already exists, skipping.
0611014 already exists, skipping.
1407.6387 already exists, skipping.
1309.5258 already exists, skipping.
1504.05274 already exists, skipping.
1701.02105 already exists, skipping.
0204010 already exists, skipping.
0303013 already exists, skipping.
2001.02852 already exists, skipping.
0710.0998 already exists, skipping.
1701.03548 already exists, skipping.
2001.04398 already exists, sk

An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1207.pdf .
HTTP Error 404: Not Found
1805.00133 already exists, skipping.
1909.09733 already exists, skipping.
1509.02971 already exists, skipping.
2002.09012 already exists, skipping.
1805.04814 already exists, skipping.
1609.06534 already exists, skipping.
1605.04886 already exists, skipping.
1606.03625 already exists, skipping.
0807.0715 already exists, skipping.
1902.07117 already exists, skipping.
1105.1160 already exists, skipping.
1005.3508 already exists, skipping.
1111.1710 already exists, skipping.
1402.5175 already exists, skipping.
1510.00442 already exists, skipping.
1507.01558 already exists, skipping.
1405.3940 already exists, skipping.
1509.07324 already exists, skipping.
0605209 already exists, skipping.
0612182 already exists, skipping.
1509.07758 already exists, skipping.
1805.04403 already exists, skipping.
1104.3708 already exists, skipping.
1604.07544 already exists, skipping.
1310.0164

An error incurred while downloading: http://export.arxiv.org/pdf/171004038.pdf .
HTTP Error 404: Not Found
1905.06174 already exists, skipping.
1906.06356 already exists, skipping.
1712.01263 already exists, skipping.
1901.06758 already exists, skipping.
1505.05256 already exists, skipping.
1702.06371 already exists, skipping.
1808.10378 already exists, skipping.
1805.05347 already exists, skipping.
1409.3085 already exists, skipping.
1804.01505 already exists, skipping.
1404.7115 already exists, skipping.
1111.3633 already exists, skipping.
1903.08807 already exists, skipping.
1802.06704 already exists, skipping.
1605.04570 already exists, skipping.
1811.12332 already exists, skipping.
1903.06577 already exists, skipping.
1612.04753 already exists, skipping.
1705.10359 already exists, skipping.
1901.09970 already exists, skipping.
1804.04272 already exists, skipping.
2003.09804 already exists, skipping.
1512.04611 already exists, skipping.
1907.06624 already exists, skipping.
1809.062

An error incurred while downloading: http://export.arxiv.org/pdf/1808.org/pdf/00776.pdf .
HTTP Error 404: Not Found
1312.4400 already exists, skipping.
1711.05225 already exists, skipping.
1412.6856 already exists, skipping.
1504.08045 already exists, skipping.
1108.2492 already exists, skipping.
1809.06932 already exists, skipping.
1705.01054 already exists, skipping.
1211.6036 already exists, skipping.
1702.06109 already exists, skipping.
1510.02284 already exists, skipping.
1107.1344 already exists, skipping.
0204005 already exists, skipping.
1707.05198 already exists, skipping.
1806.05545 already exists, skipping.
1912.13262 already exists, skipping.
2002.06413 already exists, skipping.
1908.04316 already exists, skipping.
1710.08404 already exists, skipping.
1911.00838 already exists, skipping.
1908.01458 already exists, skipping.
1911.00837 already exists, skipping.
1101.1866 already exists, skipping.
1905.08229 already exists, skipping.
1703.07842 already exists, skipping.
1905.

An error incurred while downloading: http://export.arxiv.org/pdf/math/1302.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/math/1009.pdf .
HTTP Error 404: Not Found
0308116 already exists, skipping.
An error incurred while downloading: http://export.arxiv.org/pdf/math/1104.pdf .
HTTP Error 404: Not Found
2001.00224 already exists, skipping.
1603.02754 already exists, skipping.
1612.05065 already exists, skipping.
1110.6357 already exists, skipping.
1902.05016 already exists, skipping.
1909.00761 already exists, skipping.
1908.07305 already exists, skipping.
1902.02854 already exists, skipping.
1809.02152 already exists, skipping.
1804.03395 already exists, skipping.
1811.12034 already exists, skipping.
1906.08130 already exists, skipping.
1909.00627 already exists, skipping.
0910.2778 already exists, skipping.
2001.03219 already exists, skipping.
2001.01270 already exists, skipping.
1910.07046 already exists, skipping.
1809.09337 already

An error incurred while downloading: http://export.arxiv.org/pdf/190505476.pdf .
HTTP Error 404: Not Found
1903.11814 already exists, skipping.
1904.03979 already exists, skipping.
1808.06611 already exists, skipping.
1305.7192 already exists, skipping.
1911.05219 already exists, skipping.
1903.11169 already exists, skipping.
1504.06847 already exists, skipping.
1808.01400 already exists, skipping.
1705.09231 already exists, skipping.
1607.06450 already exists, skipping.
1409.0473 already exists, skipping.
1602.02218 already exists, skipping.
1611.01989 already exists, skipping.
1707.02275 already exists, skipping.
1809.05193 already exists, skipping.
1705.07962 already exists, skipping.
1603.06129 already exists, skipping.
1611.08307 already exists, skipping.
1607.04606 already exists, skipping.
1406.2751 already exists, skipping.
1611.01576 already exists, skipping.
1805.08490 already exists, skipping.
1509.00519 already exists, skipping.
1704.06611 already exists, skipping.
1901.018

An error incurred while downloading: http://export.arxiv.org/pdf/arxiv.org/pdf/1806.pdf .
HTTP Error 404: Not Found
1907.11793 already exists, skipping.
1811.08839 already exists, skipping.
1903.01047 already exists, skipping.
1907.11422 already exists, skipping.
2002.07152 already exists, skipping.
1905.07468 already exists, skipping.
1905.02592 already exists, skipping.
2002.07152 already exists, skipping.
1905.00536 already exists, skipping.
2001.07741 already exists, skipping.
1812.01602 already exists, skipping.
1905.07468 already exists, skipping.
1905.02592 already exists, skipping.
1907.11422 already exists, skipping.
2001.07477 already exists, skipping.
1903.01047 already exists, skipping.
1501.01579 already exists, skipping.
1712.06128 already exists, skipping.
1706.08728 already exists, skipping.
1804.08870 already exists, skipping.
1904.00121 already exists, skipping.
1904.00121 already exists, skipping.
1503.07224 already exists, skipping.
1611.01845 already exists, skippi

Getting article http://export.arxiv.org/pdf/1809.05828.pdf
It seems like there was an error in converting 1809.05828.pdf. Please try again later. Exit status 256.
2002.00661 already exists, skipping.
1806.10729 already exists, skipping.
1611.05397 already exists, skipping.
1611.03673 already exists, skipping.
1807.11916 already exists, skipping.
1712.10321 already exists, skipping.
1101.2599 already exists, skipping.
1109.0887 already exists, skipping.
1903.06246 already exists, skipping.
1412.6980 already exists, skipping.
1506.01186 already exists, skipping.
1604.06737 already exists, skipping.
1206.5533 already exists, skipping.
1512.03385 already exists, skipping.
1608.03983 already exists, skipping.
1704.00109 already exists, skipping.
1710.05941 already exists, skipping.
1802.10026 already exists, skipping.
1803.00885 already exists, skipping.
1803.05407 already exists, skipping.
1708.07120 already exists, skipping.
1803.09820 already exists, skipping.
1608.06993 already exists, 

An error incurred while downloading: http://export.arxiv.org/pdf/0512608.pdf .
HTTP Error 404: Not Found
1909.13847 already exists, skipping.
2001.06539 already exists, skipping.
1711.00839 already exists, skipping.
1808.00848 already exists, skipping.
1912.07529 already exists, skipping.
1912.08232 already exists, skipping.
1609.03499 already exists, skipping.
1612.07837 already exists, skipping.
1802.08435 already exists, skipping.
1711.10433 already exists, skipping.
1807.07281 already exists, skipping.
1811.00002 already exists, skipping.
1811.06292 already exists, skipping.
1811.03021 already exists, skipping.
1804.09593 already exists, skipping.
1811.11913 already exists, skipping.
1801.04406 already exists, skipping.
1412.6980 already exists, skipping.
1603.04467 already exists, skipping.
1904.12088 already exists, skipping.
1910.09076 already exists, skipping.
1906.07694 already exists, skipping.
1906.07696 already exists, skipping.
1802.01534 already exists, skipping.
1812.081

Getting article http://export.arxiv.org/pdf/1903.03107.pdf
It seems like there was an error in converting 1903.03107.pdf. Please try again later. Exit status 256.
1705.09792 already exists, skipping.
1412.6980 already exists, skipping.
1609.03499 already exists, skipping.
1907.06405 already exists, skipping.
1602.02326 already exists, skipping.
1604.02019 already exists, skipping.
1411.4317 already exists, skipping.
1405.7033 already exists, skipping.
1409.1556 already exists, skipping.
1611.01578 already exists, skipping.
1412.6980 already exists, skipping.
1502.03167 already exists, skipping.
1802.01548 already exists, skipping.
1903.03893 already exists, skipping.
1711.04528 already exists, skipping.
1711.00436 already exists, skipping.
1605.07648 already exists, skipping.
1312.4400 already exists, skipping.
1605.07146 already exists, skipping.
1912.02771 already exists, skipping.
1905.13409 already exists, skipping.
1811.03728 already exists, skipping.
2002.08313 already exists, sk

An error incurred while downloading: http://export.arxiv.org/pdf/190510866 2019.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/170902432 2017.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/170100160 2016.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/14126980 2014.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/13126114.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/171208708 2017.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/181208373 2018.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/180409269.pdf .
HTTP Error 404: Not Found
An error incurred while downloading: http://export.arxiv.org/pdf/190604029 2019.pdf .
HTTP Error 404: Not Found
An e

An error incurred while downloading: http://export.arxiv.org/pdf/0701186.pdf .
HTTP Error 404: Not Found
1710.07970 already exists, skipping.
1811.02240 already exists, skipping.
1804.03328 already exists, skipping.
1909.00219 already exists, skipping.
1601.05504 already exists, skipping.
1904.10880 already exists, skipping.
1906.11250 already exists, skipping.
1912.11106 already exists, skipping.
2003.01722 already exists, skipping.
1911.09768 already exists, skipping.
1812.01544 already exists, skipping.
1706.05394 already exists, skipping.
1611.01838 already exists, skipping.
1806.00900 already exists, skipping.
1706.02677 already exists, skipping.
1512.03385 already exists, skipping.
1502.01852 already exists, skipping.
1802.05300 already exists, skipping.
1502.03167 already exists, skipping.
1712.05055 already exists, skipping.
1609.04836 already exists, skipping.
1802.06175 already exists, skipping.
1708.02862 already exists, skipping.
1712.09203 already exists, skipping.
1411.77

It seems like there was an error in converting 1011.3176.pdf. Please try again later. Exit status 256.
1909.00381 already exists, skipping.
1906.07432 already exists, skipping.
1806.04011 already exists, skipping.
1607.03452 already exists, skipping.
1705.04851 already exists, skipping.
1907.11476 already exists, skipping.
1905.12648 already exists, skipping.
1906.04870 already exists, skipping.
1511.03575 already exists, skipping.
1610.05492 already exists, skipping.
1608.06879 already exists, skipping.
1905.02637 already exists, skipping.
1910.05857 already exists, skipping.
1712.00232 already exists, skipping.
1909.11774 already exists, skipping.
1903.07266 already exists, skipping.
1903.05012 already exists, skipping.
1702.08734 already exists, skipping.
1609.07228 already exists, skipping.
1808.09239 already exists, skipping.
1906.00658 already exists, skipping.
1807.00299 already exists, skipping.
1203.2288 already exists, skipping.
1908.08269 already exists, skipping.
1611.03423

An error incurred while downloading: http://export.arxiv.org/pdf/0000.0000.pdf .
HTTP Error 404: Not Found
2002.10071 already exists, skipping.
1505.07072 already exists, skipping.
1905.13285 already exists, skipping.
1805.01648 already exists, skipping.
1906.08530 already exists, skipping.
1903.08111 already exists, skipping.
1204.3071 already exists, skipping.
1904.08219 already exists, skipping.
1910.10125 already exists, skipping.
1708.04215 already exists, skipping.
1906.05316 already exists, skipping.
1812.05119 already exists, skipping.
1901.06116 already exists, skipping.
1509.03025 already exists, skipping.
1902.07698 already exists, skipping.
1712.09941 already exists, skipping.
1711.10467 already exists, skipping.
1508.01922 already exists, skipping.
1708.03288 already exists, skipping.
1605.07051 already exists, skipping.
2002.03937 already exists, skipping.
1111.0444 already exists, skipping.
1506.08056 already exists, skipping.
0212095 already exists, skipping.
1404.1578 

#### Now the strategy to evaluate the model is simple. We create two vectorised datasets, one with the citing articles and one with the cited articles. We then perform a cosine similarty search on the citing articles over the cited articles and we consider how many citations we are able to catch in the first N results. First we therefore need to vectorize the citing and the cited articles.

In [14]:
#Not all the articles in citations_db were downloaded, either for problems in the search of citations or because
#the link was not valid. so we create a new database with the papers that were actually searched
d={}
for cited, citations in citations_db.items():
    txt_labels=[]
    for citation in citations:
        idx,dummy=get_id_url(citation)
        article_path=os.path.join(Config.txt_db,idx+'.txt')
        if os.path.isfile(article_path):
            txt_labels.append(idx)
    if len(txt_labels)>0:
        d[cited]=txt_labels

In [15]:
# Get the set of all cited articles
txt_cited=[]
for citations in d.values():
    txt_cited=txt_cited+citations
txt_cited=set(txt_cited)

In [16]:
# Vectorize the cited articles
corpus=read_clean(txt_cited)
Xctd=tfidf.transform(corpus)

In [17]:
cited_vectorized={'Xctd':Xctd,'articles':np.array(list(txt_cited))}

In [18]:
# Save the vectorized form of the cited articles
with open('cited_vectorized','wb') as file:
    pickle.dump(cited_vectorized,file)

In [19]:
# Vectorize the citing articles 
corpus=read_clean(d.keys())
Xctn=tfidf.transform(corpus)

In [20]:
citing_vectorized={'Xctn':Xctn,'articles':np.array(list(d.keys()))}

In [21]:
#Save the vectorized form of the citing articles
with open('citing_vectorized','wb') as file:
    pickle.dump(citing_vectorized,file)

#### We can now write a function that performs the evaluation. This function simply computes the cosine similarity between the citing articles and the cited articles and finds how many of the cited articles are returned in the first how_many articles. Since for each citing articles the number of cited articles is not always the same, as final resut we will consider the average of the rates of correctly guessed papers.

In [22]:
#Building a function that evaluate the model
def evaluation_reccomender(cited_vectorized,citing_vectorized,d,how_many):
    Xctd=cited_vectorized['Xctd']
    articles_ctd=cited_vectorized['articles']
    Xctn=citing_vectorized['Xctn']
    articles_ctn=citing_vectorized['articles']
    similar_articles=find_similar(Xctd,Xctn,how_many)
    n_similars=0
    for i,similar in enumerate(similar_articles):
        n_similar=set(articles_ctd[similar]).intersection(set(d[articles_ctn[i]]))
        n_similars=n_similars+len(n_similar)/len(d[articles_ctn[i]])
    return n_similars/(i+1)

In [24]:
how_many=50
print("The ratio of correctly guessed papers over a window of {} articles is {:.3f}".format(how_many,evaluation_reccomender(cited_vectorized,citing_vectorized,d,how_many)))

The ratio of correctly guessed papers over a window of 50 articles is 0.006


#### As can be seen the fraction of correctly guessed articles is 0.5% over 50 articles. This result is definitly not exciting, but we need to consider that only a fraction of the cited papers can be found on the arXiv. It would be interesting to consider what results we would get with more available resources. A possible improvement of the reccomender system would be to use an approach similar to the collaborative filtering, but also in this case we would need access to more resources. 