Skip to content
Pierre Chaumeil edited this page Sep 11, 2023 · 21 revisions

Scrape LPSN

cd /srv/db/gtdb/metadata/release213
mkdir lpsn
cd lpsn
mkdir lpsn_<date> (i.e. lpsn_20220718)

gtdb_migration_tk lpsn pull_html -o /srv/db/gtdb/metadata/release<version>/lpsn/
gtdb_migration_tk lpsn parse_html --in_dir . -o parse_html/ --lpsn_gss_file lpsn_gss_2022-09-19.csv

It may be necessary to run lpsn pull_html multiple times since downloading of LPSN pages can fail. Failed downloads are indicated in the *_failed.lst files. The --skip_taxa_per_letter_dl can be used to speed up additional runs of pull_html. There should be no failed downloads before proceeding with lpsn parse_html.

Create the date table

gtdb_migration_tk strains date_table --lpsn_scraped_species_info /srv/db/gtdb/metadata/release<release#>/lpsn/parse_html/all_ranks/lpsn_species.tsv --lpsn_gss_file ../lpsn/lpsn_gss_2022-09-19.csv --output_file year_table.txt

The lpsn_gss_file file is obtained from the Download section of the LPSN website.

Create a summary table for each source

python -m gtdb_migration_tk strains type_table --lpsn_dir ../lpsn/parse_html/all_ranks/ --year_table year_table.tsv --metadata /srv/db/gtdb/metadata/release207/metadata_for_import/metadata207.tsv --ncbi_names /srv/db/gtdb/metadata/release207/ncbi/taxonomy/taxdump/20210719/names.dmp --ncbi_nodes /srv/db/gtdb/metadata/release207/ncbi/taxonomy/taxdump/20210719/nodes.dmp --cpus 2 --output_dir . --lpsn_gss_file ../lpsn/lpsn_gss_2021-08-23.csv