# Keyword analysis of STA papers Part 1

In this notebook:

- I make a query for the papers 
- I download them
- I extract the .tex files
- I make the word count in each of the .tex files.
    
Besides I keep a record of:
- The results of the query, in ```papers.json```.
- The arXiv IDs of the papers in ```papers.json``` that have a ```.tex```file, in ```papers_withTeX.txt```
- The arXiv IDs of the papers in ```papers.json``` that don't have a ```.tex```file, in ```papers_withoutTeX.txt```
- The raw word count data of the analyzed .TeX files, in ```raw_data.txt```

In [1]:
import matplotlib.pyplot as plt # Plots
import re #regular expresions
import tarfile #open tarfiles
import arxiv   #arXiv wrapper
import numpy as np #numeric tools
import json

In [2]:
import os #Operative system utilities
import sys #system
import texcounter as TeX #my functions to work with .tex files

## Get the data set from the arXiv API

Use the arXiv API wrapper to make a queries of papers from the arXiv belonging to the category *quant_ph* (quantum physics).

### List of query keywords

In [4]:
query_keywords = [
    'shortcuts'
    ,'counterdiabatic'
    ,'transitionless'
]

### Container of the results

In [5]:
papers_list = []
results = 0

### Make a query for each query keyword

In [6]:
for key_word in query_keywords:
    results = len(papers_list)
    query_string = 'all:{} AND cat:quant-ph'.format(key_word)
    papers_list.extend(
        arxiv.query(
                    query=query_string,
                    sort_by='submittedDate',
                    max_results=10
                   )
    )
    print('- Query of',"'{}'".format(key_word),'returned',len(papers_list)-results,'results.\n')
    
results = len(papers_list)  
print('\n\n*** Returned',results,'results in total **\n\n')

- Query of 'shortcuts' returned 10 results.

- Query of 'counterdiabatic' returned 10 results.

- Query of 'transitionless' returned 10 results.



*** Returned 30 results in total **




### To avoid duplicates create a dictionary where the keys are arXiv IDs and the content is the query result

In [7]:
papers_dict = dict()

for paper in papers_list:
    ID = paper['id'].split('/')[-1]
    papers_dict[ID] = paper

    
print('Number of papers without duplicates:',len(papers_dict))


Number of papers without duplicates: 28


### Some papers may have more than one version, keep only the newest version (TO DO)

In [8]:
# # First, sorted the dictionary IDS
# sorted_IDs = sorted(papers_dict.keys())
# len(sorted_IDs)

In [9]:
# # list of sorted IDs without version number
# IDs_no_version = list(dict.fromkeys([ID[:ID.find('v')] for ID in sorted_IDs]))
# len(IDs_no_version)

### Save the dictionary with the results in a file

In [10]:
# .json file
json_file = json.dumps(papers_dict)
with open("query_results.json","w") as f:
    f.write(json_file)

#### Optionally, show all the results

In [11]:
# for paper in STA_papers:
#     print('DATE:',paper.get('published','N.A.'))
#     print('TITLE:',paper.get('title','UNTITLED'),'\n\n') 

## Make a directory for the source files of the papers and download them

In [12]:
source_folder = 'paper_source_files/'

In [13]:
%mkdir paper_source_files

Use the arXiv API wrapper function `download` to download the papers tarfiles ([I contributed to this feature!!!!](https://github.com/lukasschwab/arxiv.py/graphs/contributors))

In [14]:
# # This one takes a time to run, be patient

# for paper in papers_dict:
#     arxiv.download(papers_dict[paper],dirpath=source_folder,prefer_source_tarfile=True)

### The names are too long, keep only the arXiv IDs

In [15]:
for filename in os.listdir(source_folder):
    if filename.endswith('.tar.gz'):
        newname = re.sub('\w+\.tar\.gz','tar.gz',filename)
        os.rename(source_folder+filename,source_folder+newname)

## Make a directory for the .tex files

In [16]:
TeX_folder = 'paper_TeX_files/'

In [17]:
%mkdir paper_TeX_files

## Get the TeX files 

### function that returns the member in the tarball with TeX extension if it exists

In [18]:
def returnTeXFileMember(tar_file):
    for member in tar_file.getmembers():
        if member.isfile() and member.name.lower().endswith('.tex'):
            return member
    return None

## Loop over the files and extract them into ```paper_TeX_files/``` 

Some of the tarballs do not have a ```.tex``` file. Make a list of the ones that do not have one

In [19]:
papers_withoutTeX = []
papers_withTeX = []

for filename in os.listdir(source_folder):
    with tarfile.open(source_folder+filename,'r') as file:
        #file = tarfile.open(source_folder+filename,'r')
        TeXFileMember = returnTeXFileMember(file)
        if TeXFileMember:
            papers_withTeX.append(filename)
            file.extract(TeXFileMember,path=TeX_folder)
            #rename the file
            os.rename(TeX_folder+TeXFileMember.name,TeX_folder+filename.replace('.tar.gz','.tex'))
        else:
            papers_withoutTeX.append(filename)
        #file.close()

Remove empty folders

In [20]:
for el in os.listdir(TeX_folder):
    if os.path.isdir(TeX_folder+el):
        os.removedirs(TeX_folder+el)
        print('removed:',el)

### adapt the names in the lists for the papers to be just the arXiv id

In [21]:
papers_withoutTeX = [ paper.replace('.tar.gz','') for paper in papers_withoutTeX ]
papers_withTeX = [ paper.replace('.tar.gz','') for paper in papers_withTeX ]

### Keep record of the papers that have .tex and the ones that don't

In [22]:
with open("papers_withTeX.txt",'w') as f:
    f.write('\n'.join( papers_withTeX ))

with open("papers_withoutTeX.txt",'w') as f:
    f.write('\n'.join( papers_withoutTeX ))

## Analyze the papers

### 1. Use a dictionary to put all the keywords and their counts

In [23]:
keywords_count = dict()

### 2. Use a dictionary that relates every word with a list of the pappers in which it appears

In [24]:
keywords_papers = dict()

### 3. Analyze all the .tex files

In [25]:
failed = list()            #list of failed files
succeeded = list()

for TeX_file in os.listdir(TeX_folder):
    #make sure only .tex files are treated: there are hidden files in the
    #folder with unwanted extensions
    if TeX_file.endswith('.tex'):
        paperID = TeX_file.replace('.tex','')
        try:
            paper_words = TeX.analyzeTeXFile(TeX_folder+TeX_file,keywords_count)
            for word in paper_words:
                keywords_papers[word] = keywords_papers.get(word,list())
                keywords_papers[word].append(paperID)
            succeeded.append( paperID )
        except:
            failed.append( paperID )
            #os.remove(TeX_folder+TeX_file) #A bit of a nasty hack
          
print('Failed: ',len(failed))
print('Successful: ',len(succeeded))

Failed:  2
Successful:  26


### 4. Put the contents of the keywords dictionary in a .json file

In [29]:
json_str = json.dumps(keywords_count)
with open('raw_word_count_data.json','w') as f_out:
    f_out.write(json_str)

### 5. Save the list of the papers that were analyzed

In [30]:
with open('analyzed_papers.txt','w') as f_out:
    f_out.write('\n'.join(succeeded))

### 6. Save the dictionary keeping track of the papers in which the keywords appear

In [31]:
json_str = json.dumps(keywords_papers)
with open("keywords_papers.json","w") as f:
    f.write(json_str)

## Don't forget of deleting the .tex and source folders!!!!!!!!!

In [32]:
%%bash

rm -r paper_source_files
rm -r paper_TeX_files