# Preparation
### Install prerequisites
1. Install python (windows app store v3.9)
2. Install git (https://git-scm.com/downloads)
3. install visual C++ build tools 14.0 (Visual Studio installer from https://visualstudio.microsoft.com/visual-cpp-build-tools/; individual components --> VS 2015 C++ build tools (v14.00))

### install aparts package and dependencies

In [2]:
python3.9 -m pip install git+https://github.com/SamBoerlijst/aparts.git
python3.9 -m pip install notebook

SyntaxError: invalid syntax (2079595370.py, line 1)

#### In meantime download a relevant article list or continue using the example csv
The package uses a csv containing titles and abstracts (and optionally author given keywords) to generate a keyword list to scan pdf files with. Such a dataset may be acquired using the following methods. To increase relevancy, one could use the [[#Query expansion]] function to propose an optimized query using tag co-occurrence. 
*  **Web of science**
	1. Go to [Web of Science](https://www-webofscience-com.ezproxy.leidenuniv.nl/wos/) and login with your institution.
	2. Enter a query broad enough to capture the scope of the bilbiography you want to tag lateron and press search.
	3. Click on ‘Export’ and select ‘Excel’.
	4. select the amount of records you want to import, and ‘Full record’ in Record content. About 200 articles are recommended, but the only true limitation is computation time during keyword generation.
	5. Convert the xslx to csv using excel or equivalent.
* **Publish or perish**
	1. Download [Publish or Perish](https://harzing.com/resources/publish-or-perish)
	2. Open the software and select Pubmed as search engine. Many search engines are available, but pubmed is one of the few including abstracts in their records.
	3. Enter a query broad enough to capture the scope of the bilbiography you want to tag later on and press search.
	4. Save the results as csv.
* **semantic scholar**
	1. Although web of science and publish or perish can retrieve elaborate datasets as described here, semantic scholar can be used to download a simplified dataset as alternative. 
	The following function downloads a simplified dataset from semantic scholar

In [None]:
from aparts.src.semantic_scholar import query_to_csv

query_to_csv(query="Culex pipiens AND population dynamics", output = 'C:/Users/boerlijs/aparts/input/papers.csv')

Nb. The package expects a comma delimited csv file with a "Article Title" and "Abstract" column. You may either manually change the column names, or define the columns in the extract_tags function.

### Install model for spacy
Spacy uses a language model to detect common words and their respective type. To download the english language model, run the following command

In [1]:
python3.9 -m spacy download en_core_web_sm

SyntaxError: invalid syntax (3892070002.py, line 1)

### Install model for nltk
NLTK, or 'natural language tokenizer' uses two libraries to tokenize words, which will be used to algorithmically identify tags. To download the tokenizers, run the following commands.

In [None]:
python3.9 -m nltk nltk.download('punkt')
python3.9 -m nltk nltk.download('punkt_tab')

Or

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Setting variables
To make working with the package easier, define some information on the input that will be used in the following codeblock:

In [1]:
#Define where the input and output folders should be stored
working_directory = "C:/Users/boerlijs/aparts"

#Define the query used for the input csv
query = "ecology AND vector OR mosquito* OR population dynamics OR Culex"
#Define the filename of the input csv, without its extension
input_csv = "savedrecs"
#Define the column names used in the input csv
title_column = "Article Title"
author_column = "Author"
abstract_column = "Abstract"
author_given_keywords_column = "Author Keywords"
published_date = "Publication Year"

#Define the filename of the original .bib file without its extension
bib_file = "Library"

#Define the folder in which the pdf files are stored. It does not matter if they are stored in subfolders.
pdf_folder = "C:/.../Zotero/storage"

#Define the filename in which to store the keywords
keyword_list = "keywords"


#Define output
##Define the name for the list of tagged files
tagged_files_csv = "tagged_files"

##Define the name for the merged output csv
output_csv = "savedrecs_tagged"
#Define what delimiter the output should use. ";" is advised due to the use of comma's in author lists, titles and abstracts
separator = ";"


#Define the filename of a curated set of articles, otherwise leave equal to the tagged csv
curated_csv = output_csv



Derive some of the needed folder and file paths from the diorectory and names just set:

In [6]:
input_folder = f"{working_directory}/input"
output_folder = f"{working_directory}/output"
template_folder = f"{input_folder}/templates"

input_csv_path = f"{input_folder}/{input_csv}.csv"
bib_file_path = f"{input_folder}/{bib_file}.bib"
keyword_list_path = f"{input_folder}/{keyword_list}.csv"
curated_csv_path = f"{input_folder}/{curated_csv}.csv"

tagged_files_csv_path = f"{output_folder}/csv/{tagged_files_csv}.csv"
output_csv_path = f"{output_folder}/csv/{output_csv}.csv"
output_csv_path_deduplicated = f"{output_folder}/csv/{output_csv}_deduplicated.csv"
output_csv_path_clusters = f"{output_folder}/csv/{output_csv}_clusters.csv"

share the working directory with the command line

In [7]:
import os
os.environ['working_directory'] = working_directory

### Collect pdf files 
To scan the full text for tags, the pdf files are needed. Possibly you already downloaded them, alternatively there are two methods to find these files.

##### Open access papers
Download links for open access papers may be found, and exported to csv, using the following function:

In [None]:
from aparts.src.semantic_scholar import batch_collect_paper_metadata

batch_collect_paper_metadata(input=input_csv_path, output=f"{input_folder}/{input_csv}_metadata.csv")

##### download specific articles using sci-hub
The following function tries to find an article by name using scholarly and then downloads it, storing it using the title as its name.

In [None]:
from aparts.src.download_pdf import scihub_download_pdf, get_article

paper = get_article(title="The unsuccessful self-treatment of a case of \"writer's block\"")
scihub_download_pdf(paper=paper, output_folder = f"{input_folder}/pdf")

Optional: One could loop over the title column of their article list to serve as input for the download function. However, it is not tested whether scihub will block such automated requests.

In [None]:
import pandas as pd

df = pd.read_csv(input_csv_path)
title_list = df[title_column]

for title in title_list:
    paper = get_article(title=title)
    scihub_download_pdf(paper=paper, output_folder = f"{input_folder}/pdf")

### Move to a working directory to store the input and output files
Using a standard folder structure makes it easier to supply input files and to store output files. To generate this structure and to see whether the APARTS package is correctly configured, run the following commands

In [8]:
cd $working_directory

C:\Users\boerlijs\aparts


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


To generate the folder structure that will be used to take and store files within the working directory, use: 

In [None]:
from aparts.src.construct_keylist import generate_folder_structure

generate_folder_structure()

- Store the article list (csv) under /input
- If you want to add to an existing .bib file, add that to /input ass well.
- Article pdfs will automatically be stored under /input/pdf as well during one of the later steps.


# Generate keyword list
To generate the keyword list from your csv of articles, run the following commands and change "records" to the name of your CSV. Optionally, a column with author given keywords and a bibfile with self-determined keywords may be supplied. If these fields are left empty, the package proceeds without them.
This step calculates tags using 7 algorithms for both the title and abstract, which may take several minutes, especially when analysing the abstracts. It then includes all keywords that are returned by at least two and no more than 4 of the 14 lists to exclude unique word(combination)s and common English. The final file is saved as "keyword_list" (or your chosen alternative) in the input folder.

In [None]:
from aparts.src.construct_keylist import generate_keylist
generate_keylist(input_folder = input_folder, records = input_csv, bibfile = bib_file, author_given_keywords=author_given_keywords_column, output_name=keyword_list)

Open the input folder. 15 files should have been created. two per algorithm, of which one for the abstracts and one for the titles, and one final list "keyword_list.txt" (or your chosen alternative). Check the generated keywords in keywords.txt and remove any artefacts. Optionally, manually curate keywords that have been missed when merging the different lists. 

In [20]:
#open input folder
os.startfile(input_folder)

Optionally, to rerun the construction of the keyword list with different minimum and maximum overlap, change the respective value and run the following:

In [None]:
from aparts.src.construct_keylist import construct_keylist

construct_keylist(input_folder = input_folder, libtex_csv = input_csv, output_name = "keywords", minimum = 2, maximum = 4)

## Tag the articles
Converts the articles from .pdf to utf8 encoded .txt, tags each file and stores the output as a csv and uses the filename (most likely equal to the article title) as key. Optionally, the tags are added to a .bib for use in a reference manager like Zotero and as markdown summaries to use in a editor like Obsidian.
Weighted tagging gives more importance to tags mentioned certain sections i.e. the abstract and discussion, as compared to the introduction. Unweighted tagging weighs all sections equally.

>If conversion is interrupted for any reason, run pdf2txtfolder again and it will pick up where it stopped previously.

To run everything at once:

In [None]:
from aparts.src.APT import automated_pdf_tagging

automated_pdf_tagging(keylist_path=keyword_list_path, source_folder=pdf_folder, bibfile=bib_file, alternate_lists="all", weighted = True, treshold = 5, summaries = True, separator = separator)

Alternatively, you may run the commands sequentially, for instance when filenames were not equal to the title. To run the pdf to txt conversion and tagging, use:

In [None]:
from aparts.src.APT import pdf2txtfolder, tag_folder_weighted

pdf2txtfolder(PDFfolder = f"{input_folder}/pdf", TXTfolder = f"{input_folder}/pdf/docs")
tag_folder_weighted(input_path = f"{input_folder}/pdf", outputCSV=tagged_files_csv_path, alternate_lists = "all", separator = separator)

To generate the .bib and .md files, use the following commands:
> - If needed (manually) filter the files that have no metadata from the resulting csv. 
> - Make sure CSVtotal has a "title" and "author" column when generating the summaries. Change the column names if needed.
> - If the templates are missing, you may download them [here](https://github.com/SamBoerlijst/aparts/tree/main/app/aparts/src/templates). Please store them under /input/templates.

In [113]:
from aparts.src.APT import write_bib, create_summaries

write_bib(output_csv_file = tagged_files_csv_path, libtex_csv = input_csv_path, bibfile = bib_file_path, bibfolder = f"{output_folder}/bib", CSVtotal = output_csv_path, separator = separator)
#create_summaries(mdFolder = f"{output_folder}/md/", Article_template = f"{template_folder}/Paper.md", Author_template = f"{template_folder}//Author.md", Journal_template = f"{template_folder}/Journal.md", CSVtotal = output_csv_path, separator = separator, author_column=author_column, title_column=title_column)


#### Optional: To merge synonyms from the resulting dataset, run the following

In [9]:
from aparts.src.deduplication import merge_similar_tags_from_dataframe, plot_pca_tags

Dataframe = merge_similar_tags_from_dataframe(input_file=output_csv_path, output=output_csv_path_deduplicated, variables="keywords", id="title", tag_length=4)
plot_pca_tags(data=Dataframe, n_components_for_variance=50, show_plots="all")

Main contributing tags for 50% explained variance: Aedes, Aedes/geminus, Aedes_(O.)_atactavittatustaxonomic, Anopheles, Anopheles/cinereus/hispaniola, Anopheles/flavirostris, Anopheles/punctimacula, Anopheles_atroparvus, Anopheles_cinereus, Anopheles_claviger, Anopheles_darlingi, Anopheles_fluviatilis, Anopheles_lesteri, Anopheles_melanoon, Anopheles_merus, Anopheles_multicolor, Anopheles_punctimacula, Anopheles_stephensi, Anti-predator_behavior, Antibiotics, Arbovirus, Argentina, BG_sentinel_trap, Colloids, Coquilettidia, Culex/modestus, Deinocerites, Diving_angle, Fungi, Horse, Ochlerotatus_pionips, Puerto_Rico, Thailand, Tripteroides, West Nile virus, aedes aegypti, buffer


# Visualization
Generate an interactive note-graph for the papers, tags, authors and journals.

In [11]:
from aparts.src.graph import graph_view

palette = ["#87DE3C", "#7029A6", "#FF1288", "#FFEB0A", "#2B95FF", "#894fc0", "#FFFFFF"]

graph_view(output_csv_path, f"{input_folder}/pdf/docs/", "1080px", "100%", 3, palette, 'network', output_folder, title_column, "Keywords", "Year", "Authors", "Journal")

TypeError: graph_view() takes 7 positional arguments but 13 were given

# Query expansion
Optimize your query by proposing similar tags using a curated set of articles. Alternatively use the first n articles as trainingset.

In [None]:
from aparts.src.query_expansion import iteratively_propose_query

query_dict = iteratively_propose_query(original_query=query, input_file=output_csv_path, trainingset=20, n_tags=30, threshold=0.2, tag_column="keywords", target_file=curated_csv_path, target_title_column=title_column, source_title_column=title_column, source_abstract_column=abstract_column, max_matches=30, method="random", subsample_size=10)
for i in query_dict: print(f"{i};{query_dict[i]['query']};{query_dict[i]['score']};{query_dict[i]['matches']}")

ValueError: np.nan is an invalid document, expected byte or unicode string.

#### Emulate the retrieved query
The following code tests the amount of titles in the curated list that are captured by the new query.

In [None]:
from aparts.src.query_expansion import emulate_query
import pandas as pd

df = pd.read_csv(output_csv_path)
df1 = pd.read_csv(curated_csv_path)

titles = emulate_query(query=query, df=df, title_column=title_column, abstract_column=abstract_column)
selection = df1[df1[title_column].isin(titles)]

print(selection)

# Articles by dissimilarity

In [14]:
from aparts.src.subsampling import subsample_from_csv
from aparts.src.query_expansion import count_title_matches_from_list

titles = subsample_from_csv(CSV_path=output_csv_path, y="keywords", x="title", n=40, distance_type="dissimilarity")
#count_title_matches_from_list(file=curated_csv_path, file1_column=title_column, selected_list=titles,show_score=True)

for article in titles: print(article)

Pictorial keys for the identification of mosquitoes (Diptera: Culicidae) associated with dengue virus transmission
Comparison of transmission parameters between Anopheles argyritarsis and Anopheles pseudopunctipennis in two ecologically different localities of Bolivia
Estimating population size by genotyping faeces
The dominant Anopheles vectors of human malaria in Africa, Europe and the Middle East: occurrence data, distribution maps and bionomic précis.
Thiacloprid-induced toxicity influenced by nutrients: Evidence from in situ bioassays in experimental ditches: Reduced toxicity under nutrient-enriched conditions
Cutadapt removes adapter sequences from high-throughput sequencing reads
Culiseta annulata : A New Mosquito For Kuwait
Anopheles (Anopheles) pseudopunctipennis Theobald (Diptera: Culicidae): neotype designation and description.
Environmental and socioeconomic effects of mosquito control in Europe using the biocide Bacillus thuringiensis subsp. israelensis (Bti)
Mosquito stud

## Identify tag clusters

In [None]:
from aparts.src.deduplication import retrieve_clusters
from aparts.src.query_expansion import analyze_clusters

title_column = "title"
dataframe = retrieve_clusters(input_file=output_csv_path, output=output_csv_path_clusters, variables="keywords", id=title_column, tag_length=4,  number_of_records="", n_components_for_variance=0, show_plots="", transpose=False, max_clusters=5, save_clusters = "all", visualize_clusters=True)
query_dict = analyze_clusters(query=query, cluster_column="Cluster", cluster_range=(1, 5),training_filepath=output_csv_path_clusters, test_filepath=curated_csv_path, title_column_training_file=title_column, title_column_test_file=title_column, abstract_column=abstract_column, keyword_column="keywords", n_tags=30, threshold=0.2)

##### Find dominant tags

In [12]:
from aparts.src.deduplication import retrieve_pca_components 
components = retrieve_pca_components(input_file=output_csv_path, output=output_csv_path, variables="keywords", id="title", tag_length=4, n_components_for_variance=80, number_of_records=50, show_plots="loading and scree and saturation")

IndexError: list index out of range



## Upcoming: text summarization
Summarize all items in a csv by the top n sentences based on the tags they contain.

In [5]:
from aparts.src.summarization import summarize_csv
summarize_csv(outputCSV="C:/NLPvenv/NLP/output/csv/summarized.csv", txtfolder=f"{input}/pdf/docs", sections=['abstract', 'discussion', 'conclusion'], amount=3, offset=2)

OSError: Cannot save file into a non-existent directory: 'C:\NLPvenv\NLP\output\csv'