# Keyword analysis of STA papers Part 1

**In this notebook:**

- I make a query of the papers I am interested in
- I download them
- I extract the .tex files
- I make the word count in each of the .tex files.
    
**The output of the notebook is:**
- The results of the query, stored in a ```JSON``` file: ```query_results.json```.
- The arXiv IDs of the papers in ```query_results.json``` that don't have a ```.tex```file, stored in a text file: ```papers_withoutTeX.txt```
- The arXiv IDs of the papers in ```query_results.json``` that have a ```.tex```file, stored in a text file: ```papers_withTeX.txt```
- The arXiv IDs of the papers in ```papers_withTeX.txt``` that could be analyzed, stored in a text file: ```analyzed_papers.txt```
- The results of the word count analysis of the .tex files in a ```JSON``` file: ```keywords.json``` 

## Import statements

**Builtin modules:**

In [None]:
import re            #regular expresions
import json          #JSON files utilities
import os            #Operative system utilities
import sys           #system
import tarfile       #open tarfiles

**Third party module:s**

In [None]:
import arxiv   #arXiv wrapper
import texcounter as TeX #my functions to work with .tex files

## Get the data set from the arXiv API

Use the arXiv API wrapper to make a queries of papers from the arXiv belonging to the category *quant_ph* (quantum physics).

**arXiv category:**

In [None]:
categories = [
    'quant-ph',
    'cond-mat'
] 

**List of query keywords**

In [None]:
query_keywords = [
    'shortcuts'
    ,'counterdiabatic'
    ,'transitionless'
]

**Max number of results for each query:**

In [None]:
max_results = 5;

**Container of the results:**

In [None]:
papers_list = []
results = 0

**Make a query for each category and query keyword:**

In [None]:
for cat in categories:
    for key_word in query_keywords:
        results = len(papers_list)
        query_string = 'all:{} AND cat:{}'.format(key_word,cat)
        papers_list.extend(
            arxiv.query(
                        query=query_string,
                        sort_by='submittedDate',
                        max_results=max_results
                       )
        )
        print("'{}' returned {} results.\n".format(query_string,len(papers_list)-results))
    
results = len(papers_list)  
print('\n\n*** Returned',results,'results in total **\n\n')

**To avoid duplicates create a dictionary where the keys are arXiv IDs and the content is the query result:**

In [None]:
papers_dict = dict()

for paper in papers_list:
    ID = paper['id'].split('/')[-1]
    papers_dict[ID] = papers_dict.get(ID,paper)

    
print('Number of papers without duplicates:',len(papers_dict))


**Some papers may have more than one version, keep only the newest version (TO DO):**

In [None]:
# # First, sorted the dictionary IDS
# sorted_IDs = sorted(papers_dict.keys())
# len(sorted_IDs)

In [None]:
# # list of sorted IDs without version number
# IDs_no_version = list(dict.fromkeys([ID[:ID.find('v')] for ID in sorted_IDs]))
# len(IDs_no_version)

**Save the dictionary with the results of the query in a JSON file:**

In [None]:
json_file = json.dumps(papers_dict)
with open("query_results.json","w") as f:
    f.write(json_file)

## Make a directory for the source files of the papers and download them

In [None]:
source_folder = 'paper_source_files/'

In [None]:
%mkdir paper_source_files

**Use the arXiv API wrapper function `download` to download the papers tarfiles ([I contributed to this feature!!!!](https://github.com/lukasschwab/arxiv.py/graphs/contributors))**

In [None]:
# # This one takes a time to run, be patient.
# # Let this cell comented if it is not going to be used.

# for paper in papers_dict:
#     arxiv.download(papers_dict[paper],dirpath=source_folder,prefer_source_tarfile=True)

**The names are too long, keep only the arXiv IDs**

In [None]:
for filename in os.listdir(source_folder):
    if filename.endswith('.tar.gz'):
        newname = re.sub('\w+\.tar\.gz','tar.gz',filename)
        os.rename(source_folder+filename,source_folder+newname)

## Make a directory for the .tex files

In [None]:
TeX_folder = 'paper_TeX_files/'

In [None]:
%mkdir paper_TeX_files

## Get the TeX files 

**function that returns the member in the tarball with TeX extension if it exists:**

In [None]:
def returnTeXFileMember(tar_file):
    for member in tar_file.getmembers():
        if member.isfile() and member.name.lower().endswith('.tex'):
            return member
    return None

**Loop over the files and extract them into ```paper_TeX_files/```**

Some of the tarballs do not have a ```.tex``` file. Make a list of the ones that do not have one

In [None]:
papers_withoutTeX = []
papers_withTeX = []

for filename in os.listdir(source_folder):
    with tarfile.open(source_folder+filename,'r') as file:
        #file = tarfile.open(source_folder+filename,'r')
        TeXFileMember = returnTeXFileMember(file)
        if TeXFileMember:
            papers_withTeX.append(filename)
            file.extract(TeXFileMember,path=TeX_folder)
            #rename the file
            os.rename(TeX_folder+TeXFileMember.name,TeX_folder+filename.replace('.tar.gz','.tex'))
        else:
            papers_withoutTeX.append(filename)
        #file.close()

Remove empty folders

In [None]:
for el in os.listdir(TeX_folder):
    if os.path.isdir(TeX_folder+el):
        os.removedirs(TeX_folder+el)
        print('removed:',el)

**adapt the names in the lists for the papers to be just the arXiv id**

In [None]:
papers_withoutTeX = [ paper.replace('.tar.gz','') for paper in papers_withoutTeX ]
papers_withTeX = [ paper.replace('.tar.gz','') for paper in papers_withTeX ]

**Keep record of the papers that have .tex and the ones that don't:**

In [None]:
with open("papers_withTeX.txt",'w') as f:
    f.write('\n'.join( papers_withTeX ))

with open("papers_withoutTeX.txt",'w') as f:
    f.write('\n'.join( papers_withoutTeX ))

## Analyze the papers

### 1. Use a dictionary to put all the keywords and their counts

In [None]:
keywords_count = dict()

### 2. Use a dictionary that relates every word with a list of the pappers in which it appears

In [None]:
keywords_papers = dict()

### 3. Analyze the papers

In [None]:
failed = list()            #list of failed files
succeeded = list()         #list of analyzed files

for TeX_file in os.listdir(TeX_folder):
    #make sure only .tex files are treated: there are hidden files in the
    #folder with unwanted extensions
    if TeX_file.endswith('.tex'):
        paperID = TeX_file.replace('.tex','')
        try:
            paper_words = TeX.analyzeTeXFile(TeX_folder+TeX_file,keywords_count)
            for word in paper_words:
                keywords_papers[word] = keywords_papers.get(word,list())
                keywords_papers[word].append((paperID,paper_words[word]))
            succeeded.append( paperID )
        except:
            failed.append( paperID )
            #os.remove(TeX_folder+TeX_file) #A bit of a nasty hack
          
print('Failed: ',len(failed))
print('Successful: ',len(succeeded))

### 4. Save the list of the papers that were analyzed

In [None]:
with open('analyzed_papers.txt','w') as f_out:
    f_out.write('\n'.join(succeeded))

### 5. Create a dictionary keeping track of the keywords: their count, the papers in which they appear (and how many times in the paper)

```python
keywords['word'] = {
    'counts':counts,
    'papers': {'paperID':countsPaper,...,}
}
```


In [None]:
#make the dictionary and fill it by descending frequency and alphabetical order
keywords = {
    word:{
        'count':keywords_count.get(word,0),
        'papers_count':dict( keywords_papers.get(word,list())  )
    }
    for word in sorted(
        keywords_count,key = lambda word : (-keywords_count[word],word)
    )
}

### 6. Saving the dictionary in a JSON file

In [None]:
with open('keywords.json','w') as f:
    json_str = json.dumps(keywords)
    f.write(json_str)

## Don't forget of deleting the .tex and source folders!!!!!!!!!

In [None]:
%%bash

rm -r paper_source_files
rm -r paper_TeX_files