# __Step 5.1: Species over time__

Goals here:
- Determine overall genus mention
- Determine genus mention over time
- Same analysis at other taxonmic levels

Considerations:
- Deal with common names
  - Common names must be those in the USDA common name database.
  - If some non-specific names are mentioned, even though they most likely refer to a particular species frequently, they will not be counted.
- Deal with synonyms
  - Both NCBI and USDA data have synonym info. They will be pointed to a specific level.
- Deal with redundancy
  - It is possible that multiple taxa levels are mentioned in a single title/abstract: e.g., Solanceae, Solanum, tomato. At the genus level, it will be counted just one time at both the family and the genus levels for this record.
- Missing info
  - Some species info may be mentioned only in the full text.

## ___Set up___

### Module import

In [4]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm

### Key variables

In [5]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "5_species_over_time/"
work_dir.mkdir(parents=True, exist_ok=True)

# species information
# NEED TO SPECIFY
# NEED TO SPECIFY
dir1           = proj_dir / "1_obtaining_corpus"
names_dmp_path = dir1 / "names.dmp"
nodes_dmp_path = dir1 / "nodes.dmp"
usda_plant_db  = dir1 / ### SPECIFY

# plant science corpus with date and other info
dir2        = proj_dir / "2_text_classify"
corpus_file = dir2 / "corpus_plant_421740.gz"

# So PDF is saved in a format properly
mpl.rcParams['pdf.fonttype'] = 42
plt.rcParams["font.family"] = "sans-serif"

## ___Get plant names___

In `1_obtaining_corpus`, plant names are from two sources:
- NCBI: the taxonomy database with mention of all taxa levels
  - This will also contain synonyms for different levels.
- USDA: plant common names with species information


### NCBI taxonomy