# CA Prop 65 Chemical categorization
California Proposition 65 requires labeling for chemical acknowledged in the state of California as causing cancer or reproductive/developmental toxicity. This enables us to categorize the chemicals on the Prop 65 list as instances of 'carcinogen', 'reproductive toxicant', 'male reproductive toxicant', 'female reproductive toxicant', and 'developmental toxicant' in Wikidata adding some very basic (and somewhat indirect) chem-disease relationships

Note that California's OEHHA allows users to download a pre-exported .csv file of chemicals listed under Prop 65 (does not include de-listed chemicals). Alternatively, users can export the complete list of chemicals from OEHHA which will include chemicals that are under consideration, currently listed, or formerly listed.

This notebook partially explores both exports for the best way for loading the data into Wikidata. The final bot will likely NOT include both methods.

In [1]:
from wikidataintegrator import wdi_core, wdi_login, wdi_helpers
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs
import pandas as pd
from pandas import read_csv
import requests
from tqdm.notebook import trange, tqdm
import ipywidgets 
import widgetsnbextension
import xml.etree.ElementTree as et 
import time


In [None]:
## Proposition 65 ID: P7524

In [2]:
## Note that the property start date is used for list date.
## When placed in the references, Deltabot moved it out as a qualifier

from datetime import datetime
import copy
def create_reference(prop65_url):
    refStatedIn = wdi_core.WDItemID(value="Q28455381", prop_nr="P248", is_reference=True)
    timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")
    refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True)
    refURL = wdi_core.WDUrl(value=prop65_url, prop_nr="P854", is_reference=True)
    return [refStatedIn, refRetrieved, refURL]

In [None]:
"""
## Login for Scheduled bot
print("Logging in...")
try:
    from scheduled_bots.local import WDUSER, WDPASS
except ImportError:
    if "WDUSER" in os.environ and "WDPASS" in os.environ:
        WDUSER = os.environ['WDUSER']
        WDPASS = os.environ['WDPASS']
    else:
        raise ValueError("WDUSER and WDPASS must be specified in local.py or as environment variables")
"""

In [3]:
print("Logging in...")
import wdi_user_config ## Credentials stored in a wdi_user_config file
login_dict = wdi_user_config.get_credentials()
login = wdi_login.WDLogin(login_dict['WDUSER'], login_dict['WDPASS'])


Logging in...
https://www.wikidata.org/w/api.php
Successfully logged in as Gtsulab


In [None]:
## Set up logging

wdi_core.WDItemEngine.setup_logging(header=json.dumps({'name': 'prop_65', 
                                                       'timestamp': str(datetime.now()), 
                                                       'run_id': str(datetime.now())}))

# Get the most recent csv file and parse it

## Prop 65 list method
The updates appear to be posted on this site: https://oehha.ca.gov/proposition-65/proposition-65-list
Since the updated data file name changes depending on the update date, it would be nice if the newer file name could be scraped from this page instead of being hard-coded in. It would be nice if date could also be scraped from this page and used for comparison to determine if changes were made and if an update is needed.

Unfortunately, this page uses iframes and is protected from robots using captchas (Incapsula), so the csv should be manually downloaded to the data folder and named by the update date (YYYY-MM-DD.csv) prior to running the bot

In [None]:
datasrc = 'data/2019-09-13.csv'

header_junk = 11 ## Note, the number of rows to skip may change
tail_junk = 6

chem_list = read_csv(datasrc, skiprows=header_junk, encoding = 'unicode_escape') 
chem_list.dropna(axis='columns', how='all',inplace=True)
chem_list.fillna("None", inplace=True)
chem_list.drop_duplicates(keep='first',inplace=True)
## Filter out blank entries
chem_list_clean = chem_list.loc[(chem_list['Chemical']!="None")].copy()
chem_list_clean.drop(chem_list_clean.tail(tail_junk).index,inplace=True)
print(chem_list_clean.tail(n=3))

Note that the csv table has sub-entries which appear as seperate entries, but should retain the header entry data except for the last field.  These will need to be accounted for. As of the csv file dated from 2019.09.13, these entries are preceded with some spaces. Another way to filter for them is to pull entries which have "None" for everything but 'chemical name' and 'NSRL or MADL'

In [None]:
sub_entries = chem_list_clean.loc[(chem_list_clean['Chemical'].str.contains("  ")) &
                                  (chem_list_clean['Type of Toxicity']=="None") &
                                  (chem_list_clean['Listing Mechanism']=="None")]
print(sub_entries.head(n=3))
print(len(sub_entries))

The actual toxicity information for these sub-entries can actually be found in the header entry.  To find these, obtain the index values for these entries. If entries are sequential, the header-entry should be 1 less the smallest index value. The values could be copied over, though the CAS number should not be copied over.

Alternatively, entries without CAS numbers can be assumed to have sub-entries and can be ignored, but doing this will skip entries which do have profiles, but not CAS numbers. Eg- Alcoholic Beverages which doesn't have a CAS No. but will likely be appropriately matched via Mix N Match. As a result, we should be able to cover cases like this even if it doesn't have a CAS number.

That said, aggregate entries only have 1 prop 65 page and multiple Wikidata pages, which would result in one to many mappings. Since it's unclear how these will be handled by the Mix N match community, the first pass should ignore the aggregate entries (which can be identified as mentioned above)

### Chemical names to URL conversion
The property in Wikidata uses the URL stub as ID so we'll need to convert the Chemical names to url stubs that work with prop65 website. The urls will then be mapped to Wikidata entries with the property that were added via Mix N match. Normally, urls can be tested, but CA Prop 65 website has captcha protection and blocks scrapers.

Also, note that the items that have been de-listed have urls, but their urls are no longer readily available (ie- the urls work, but are not necessarily listed)

Example conversion:
* "A-alpha-C (2-Amino-9H-pyrido[2,3-b]indole)" --> "alpha-c-2-amino-9h-pyrido23-bindole"
* "Altretamine" --> "altretamine"
* "Allyl chloride  Delisted October 29, 1999 [Click here for the basis for delisting]" --> "allyl-chloride"
* "p-Aminoazobenzene" --> "p-aminoazobenzene"
* "4-Aminobiphenyl (4-aminodiphenyl)" --> "4-aminobiphenyl-4-aminodiphenyl"
* "2-Amino-5-(5-nitro-2-furyl)-1,3,4-thiadiazole" --> "2-amino-5-5-nitro-2-furyl-134-thiadiazole"
* "?-Methyl styrene (alpha-Methylstyrene)" --> "methyl-styrene-alpha-methylstyrene"
* "N,N'-Diacetylbenzidine" --> "nn-diacetylbenzidine"
* "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline"
* "2?Mercaptobenzothiazole" --> "2-mercaptobenzothiazole"
* "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1"
* "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages"
* "Aspirin (NOTE:  It is especially  important not to use aspirin during the last three months of pregnancy,  unless specifically directed to do so by a physician because it may cause  problems in the unborn child or  complications during delivery.)" --> "aspirin"

Clean up miscellenous notes included in the chemical names such as 'delistings, delisting dates, changes in listings'. This should be added as deprecation of the statements. 

Because of all these issues, we'll work with the OEHHA list rather than the Prop 65 list

## CA OEHHA clean up method
The manually triggered export of chemical list from the OEHHA site has less header junk, random title notes, and other things that disrupt the structure which would make the name convers easier. Additionally, the data on the cancer, reproductive toxicity, etc. is more structured, and doesn't have random blank spaces

In [4]:
datasrc = 'data/OEHHA-2019-11-1.csv'

chem_list = read_csv(datasrc, encoding = 'unicode_escape', header=0) 
print(chem_list.columns)
chem_list.dropna(axis='columns', how='all',inplace=True)
chem_list.fillna("None", inplace=True)
#print(chem_list.columns.values)

## Pull out only columns of interest for our task
cols_of_interest = chem_list[['Title','CAS Number','Cancer','Cancer - Listing Mechanism',
                          'Reproductive Toxicity','Chemical listed under Proposition 65 as causing',
                          'Developmental Toxicity - Date of Listing','Developmental Toxicity - Listing Mechanism',
                          'Female Reproductive Toxicity - Date of Listing',
                          'Female Reproductive Toxicity - Listing Mechanism',
                          'Male Reproductive Toxicity - Date of Listing',
                          'Male Reproductive Toxicity - Listing Mechanism']]

## Remove entries which are not relevant
prop_65_irrelevant = cols_of_interest.loc[(cols_of_interest['Cancer'] == "None") & 
                                          (cols_of_interest['Reproductive Toxicity'] == "None") & 
                                          (cols_of_interest['Chemical listed under Proposition 65 as causing'] == "None")]
non_prop_chems = prop_65_irrelevant['Title'].tolist()
prop65_chems = cols_of_interest.loc[~cols_of_interest['Title'].isin(non_prop_chems)].copy()
#print(prop65_chems.head(n=2))

Index(['Title', 'CAS Number', 'Use(s)', 'Synonym(s)', 'Latest Criteria',
       'Inhalation Unit Risk (Î¼g/cubic meter)-1',
       'Inhalation Slope Factor (mg/kg-day)-1',
       'Oral Slope Factor (mg/kg-day)-1', 'Last Cancer Potency Revision',
       'Acute REL (Î¼g/m3)', 'Species', 'Acute REL Toxicologic Endpoint',
       'Acute REL Target Organs', 'Severity', 'Last Acute REL Revision',
       '8-Hour Inhalation REL (Î¼g/m3)', 'Last 8-Hour REL Revision',
       'Chronic Inhalation REL (Î¼g/m3)', 'Chronic Toxicologic Endpoint',
       'Chronic Target Organs', 'Human Data', 'Health Risk Category',
       'Cancer Risk at PHG', 'MCL value (mg/L)', 'Cancer Risk at MCL',
       'Notification Level (Î¼g/L)', 'Public Health Goal (mg/L)',
       'Last PHG Revision', 'No Significant Risk Level (NSRL) - Inhalation',
       'No Significant Risk Level (NSRL) - Oral',
       'Maximum Allowable Dose Level (MADL) for chemicals causing reproductive toxicity - Inhalation',
       'Maximum Allowable D

### Chemical names to URL conversion
The property in Wikidata uses the URL stub as ID so we'll need to convert the Chemical names to url stubs that work with prop65 website. The urls will then be mapped to Wikidata entries with the property that were added via Mix N match. Normally, urls can be tested, but CA Prop 65 website has captcha protection and blocks scrapers.

Example conversion:
"OEHHA listing" --> "OEHHA url" | "Prop 65 listing" --> "Prop 65 url"
* "Amino-alpha-carboline" --> "amino-alpha-carboline" | "A-alpha-C (2-Amino-9H-pyrido[2,3-b]indole)" --> "alpha-c-2-amino-9h-pyrido23-bindole"
* "Allyl Chloride" --> "allyl-chloride" | not listed
* "alpha-Methylstyrene" --> "alpha-methylstyrene" | "?-Methyl styrene (alpha-Methylstyrene)" --> "methyl-styrene-alpha-methylstyrene"
* "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline" | "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline"
* "2-Mercaptobenzothiazole" --> "2-mercaptobenzothiazole" | "2?Mercaptobenzothiazole" --> "2-mercaptobenzothiazole"
* "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1" | "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1"
* "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages" | "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages"
* "Aspirin" --> "aspirin" | "Aspirin (NOTE:  It is especially  important not to use aspirin during the last three months of pregnancy,  unless specifically directed to do so by a physician because it may cause  problems in the unborn child or  complications during delivery.)" --> "aspirin"

In [5]:
## To convert the title to a url stub, lower case it, strip out parenthesis, brackets, and commas, and replace spaces with dashes
prop65_chems['url_stub'] = prop65_chems['Title'].str.lower().str.replace("[","").str.replace("]","").str.replace(",","").str.replace("(","").str.replace(")","").str.strip("]").str.replace(" ","-")
#print(prop65_chems.head())

## Check the look of the url stub
#print(prop65_chems.loc[prop65_chems['Title']=="Allyl Chloride"])
#print(prop65_chems.loc[prop65_chems['Title']=="Trp-P-1 (Tryptophan-P-1)"])
#print(prop65_chems.loc[prop65_chems['Title']=="MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)"])
#print(prop65_chems.head(n=2))

#print(prop65_chems.head(n=2))
mixnmatch_cat = prop65_chems[['url_stub','Title','CAS Number']].copy()
mixnmatch_cat.rename(columns={'url_stub':'Entry ID','Title':'Entry name'}, inplace=True)
mixnmatch_cat['Entry description'] = mixnmatch_cat['Entry name'].astype(str).str.cat(mixnmatch_cat['CAS Number'].astype(str),sep=", CAS Number: ")
#mixnmatch_cat.drop('CAS Number',axis=1,inplace=True)
print(mixnmatch_cat.head(n=2))

#mixnmatch_cat.to_csv('data/mixnmatch_cat.tsv',sep='\t', header=True)

              Entry ID           Entry name   CAS Number  \
0  abiraterone-acetate  Abiraterone acetate  154229-18-2   
2         acetaldehyde         Acetaldehyde      75-07-0   

                              Entry description  
0  Abiraterone acetate, CAS Number: 154229-18-2  
2             Acetaldehyde, CAS Number: 75-07-0  


### Mapping items to reduce Mix N Match need
For entries that can be mapped via CAS ID go ahead and add the Prop65 ID

In [6]:
sparqlQuery = "SELECT * WHERE {?item wdt:P231 ?CAS}"
result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
cas_in_wd_list = []

i=0
while i < len(result["results"]["bindings"]):
    cas_id = result["results"]["bindings"][i]["CAS"]["value"]
    wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
    cas_in_wd_list.append({'WDID':wdid,'CAS Number':cas_id})
    i=i+1

cas_in_wd = pd.DataFrame(cas_in_wd_list)
cas_in_wd.drop_duplicates(subset='CAS Number',keep=False,inplace=True)
cas_in_wd.drop_duplicates(subset='WDID',keep=False,inplace=True)
print(cas_in_wd.head(n=2))

   CAS Number   WDID
0  12385-13-6   Q556
1  53850-35-4  Q1232


In [7]:
prop_65_matches = mixnmatch_cat.merge(cas_in_wd,on='CAS Number',how='inner')
print(prop_65_matches.head(n=2))
print(len(prop_65_matches))
#prop_65_matches.to_csv('data/mixnmatch_cat_with_cas.tsv',sep='\t', header=True)

              Entry ID           Entry name   CAS Number  \
0  abiraterone-acetate  Abiraterone acetate  154229-18-2   
1         acetaldehyde         Acetaldehyde      75-07-0   

                              Entry description       WDID  
0  Abiraterone acetate, CAS Number: 154229-18-2  Q27888393  
1             Acetaldehyde, CAS Number: 75-07-0     Q61457  
798


In [8]:
## Pull things matched via mix n match
sparqlQuery = "SELECT ?item ?CA65 WHERE {?item wdt:P7524 ?CA65}"
result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
CA65_in_wd_list = []

i=0
while i < len(result["results"]["bindings"]):
    CA65_id = result["results"]["bindings"][i]["CA65"]["value"]
    wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
    CA65_in_wd_list.append({'WDID':wdid,'Entry ID':CA65_id})
    i=i+1

CA65_in_wd = pd.DataFrame(CA65_in_wd_list)
print(len(CA65_in_wd))

107


In [9]:
## Remove items matched via mix n match from update
#print(CA65_in_wd.head(n=2))
prop_65_less_mixnmatch = prop_65_matches.loc[~prop_65_matches['Entry ID'].isin(CA65_in_wd['Entry ID'].tolist())]
print(prop_65_less_mixnmatch.head(n=2))

     Entry ID Entry name  CAS Number                  Entry description  \
58   auramine   Auramine    492-80-8     Auramine, CAS Number: 492-80-8   
59  auranofin  Auranofin  34031-32-8  Auranofin, CAS Number: 34031-32-8   

         WDID  
58  Q26840770  
59    Q421230  


In [None]:
prop65_to_add = prop_65_less_mixnmatch[0:10]
url_base = 'https://oehha.ca.gov/chemicals/'
list_prop = "P7524" 

for i in tqdm(range(len(prop65_to_add))):
    prop_65_qid = prop65_to_add.iloc[i]['WDID']
    prop_65_id = prop65_to_add.iloc[i]['Entry ID']
    prop_65_url = url_base+prop_65_id
    reference = create_reference(prop_65_url)
    prop65_statement = [wdi_core.WDString(value=prop_65_id, prop_nr=list_prop, 
                               references=[copy.deepcopy(reference)])]
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=prop65_statement, append_value=list_prop,
                               global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added CA prop 65 id")
    print(prop_65_id, prop_65_qid, prop_65_url)
    

# Clean up the data for Wikidata

In [10]:
## Run sparql query to pull all entities with Prop 65 ID (Read Only Run)
sparqlQuery = "SELECT ?item ?CA65 WHERE {?item wdt:P7524 ?CA65}"
result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
CA65_in_wd_list = []

i=0
while i < len(result["results"]["bindings"]):
    CA65_id = result["results"]["bindings"][i]["CA65"]["value"]
    wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
    CA65_in_wd_list.append({'WDID':wdid,'url_stub':CA65_id})
    i=i+1

## Inspect the results for mapping or coverage issues
CA65_in_wd = pd.DataFrame(CA65_in_wd_list)
print("resulting mapping table has: ",len(CA65_in_wd)," rows.")

resulting mapping table has:  107  rows.


In [11]:
## Perform left merge for currently listed and partially delisted items
prop_65_mapped = prop65_chems.merge(CA65_in_wd, on='url_stub', how='left')
#print(prop_65_mapped.head(n=2))

In [12]:
prop_65_mapped['Dev current'] = prop_65_mapped['Chemical listed under Proposition 65 as causing'].str.contains("Development")
prop_65_mapped['Male current'] = prop_65_mapped['Chemical listed under Proposition 65 as causing'].str.contains("Male")
prop_65_mapped['Female current'] = prop_65_mapped['Chemical listed under Proposition 65 as causing'].str.contains("Female")
prop_65_mapped['Cancer current'] = prop_65_mapped['Cancer'].str.contains("Current")
prop_65_mapped['Rep current'] = prop_65_mapped['Reproductive Toxicity'].str.contains("Current")

In [13]:
prop_65_mapped['Cancer delisted'] = prop_65_mapped['Cancer'].str.contains("Formerly")
prop_65_mapped['Rep delisted'] = prop_65_mapped['Reproductive Toxicity'].str.contains("Formerly")
prop_65_mapped.loc[(((prop_65_mapped['Developmental Toxicity - Date of Listing']!="None")|
        (prop_65_mapped['Developmental Toxicity - Listing Mechanism']!="None"))&
        (prop_65_mapped['Dev current']==False)), 'Dev delisted'] = True
prop_65_mapped.loc[(((prop_65_mapped['Female Reproductive Toxicity - Date of Listing']!="None")|
        (prop_65_mapped['Female Reproductive Toxicity - Listing Mechanism']!="None"))&
        (prop_65_mapped['Female current']==False)), 'Female delisted'] = True
prop_65_mapped.loc[(((prop_65_mapped['Male Reproductive Toxicity - Date of Listing']!="None")|
        (prop_65_mapped['Male Reproductive Toxicity - Listing Mechanism']!="None"))&
        (prop_65_mapped['Male current']==False)), 'Male delisted'] = True

In [14]:
prop_65_mapped.fillna(False, inplace=True)

In [15]:
chem_info = prop_65_mapped.iloc[0]
print(chem_info)

Title                                                                             Abiraterone acetate
CAS Number                                                                                154229-18-2
Cancer                                                                                           None
Cancer - Listing Mechanism                                                                       None
Reproductive Toxicity                                                                Currently listed
Chemical listed under Proposition 65 as causing     Developmental Toxicity, Male Reproductive Toxi...
Developmental Toxicity - Date of Listing                                                     4/8/2016
Developmental Toxicity - Listing Mechanism                                                         FR
Female Reproductive Toxicity - Date of Listing                                               4/8/2016
Female Reproductive Toxicity - Listing Mechanism                                  

In [None]:
#### Query for instance of Carcinogen, reproductive toxicant, etc
#### If statement exists, check if it needs to be updated (rank changed to deprecated)
#### Else, make no change



In [None]:
## Check presence in Wikidata
object_qid = {'femrep':'Q55427776',
              'menrep': 'Q55427774',
              'devtox': 'Q72941151',
              'cancer': 'Q187661',
              'reptox': 'Q55427767'}

tmplist = []

for eachentry in object_qid.keys():
    sparqlQuery = "SELECT ?item WHERE {?item wdt:P31 wd:"+object_qid[eachentry]+".}"
    result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
    
    i=0
    while i < len(result["results"]["bindings"]):
        wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
        tmplist.append({'WDID':wdid,'InstanceType':eachentry})
        i=i+1

instances_in_wd = pd.DataFrame(tmplist)
    
print(len(instances_in_wd))

# Creating and Writing WD statements

## Delisted items

After scanning the data, it appears that this table will allow us to easily identify entities which were 100% delisted, but it will require a bit more logic to identify entities that have been delisted for some conditions but not others.

Some sample comparisons between the Prop 65 list and the OEHHA list:

As seen in Prop 65 list, we can see that BPA was listed as causing female reproductive and developmental toxicity, but it's listing as a developmental toxicant was removed on April 19, 2013.
* Bisphenol A (BPA)	female
* Bisphenol A (BPA)  Delisted April 19, 2013 developmental

In contrast its (Bisphenol A) entry in the OEHHA list is illustrated as follows:
* Reproductive Toxicity: Currently listed
* Chemical listed under Proposition 65 as causing: Female Reproductive Toxicity
* Developmental Toxicity - Date of Listing: 4/11/2013
* Developmental Toxicity - Listing Mechanism: AB-NTP-CERHR
* Female Reproductive Toxicity - Date of Listing:	5/11/2015
* Female Reproductive Toxicity - Listing Mechanism: SQE

As seen above, the delisting as a developmental toxicity item has to be inferred based on it's listing as causing only Female Reproductive Toxicity event though it has entries for dates and mechanism for Developmental toxicity entries.

Completely delisted entries are more straightforward as seen in the case of Allyl Chloride:
This entity is not even listed in the Prop 65 list. In contrast, in the OEHHA list it appears as:
* Cancer: Formerly listed
* Cancer - Listing Mechanism: AB-US EPA 	

Note that the prop_65 export csv will have the date listed for cancer, but the OEHHA export csv does not have cancer and reprotox date listed and delisted in the table even though this information may be available on the webpage for the chemical

In [18]:
### Identify completely delisted items
delisted_df = prop_65_mapped.loc[((prop_65_mapped['Cancer']=="Formerly listed") & (prop_65_mapped['Reproductive Toxicity']=="None"))|
                               ((prop_65_mapped['Cancer']=="None") & (prop_65_mapped['Reproductive Toxicity']=="Formerly listed"))]
delisted_titles = delisted_df['Title'].tolist()
#print(delisted_df.head(n=5))

delisted_in_wd = delisted_df.loc[~delisted_df['WDID'].isnull()]
print(delisted_in_wd.columns)

Index(['Title', 'CAS Number', 'Cancer', 'Cancer - Listing Mechanism',
       'Reproductive Toxicity',
       'Chemical listed under Proposition 65 as causing',
       'Developmental Toxicity - Date of Listing',
       'Developmental Toxicity - Listing Mechanism',
       'Female Reproductive Toxicity - Date of Listing',
       'Female Reproductive Toxicity - Listing Mechanism',
       'Male Reproductive Toxicity - Date of Listing',
       'Male Reproductive Toxicity - Listing Mechanism', 'url_stub', 'WDID',
       'Dev current', 'Male current', 'Female current', 'Cancer current',
       'Rep current', 'Cancer delisted', 'Rep delisted', 'Dev delisted',
       'Female delisted', 'Male delisted'],
      dtype='object')


## CLEAN UP THE CODE

There's a lot of unnecessary redundancy.  Rather than split up the list and then repeating the code over and over again, create a single code with functions that can handle the different conditions.

Create a status flag which can be used to manipulate which functions are called. Eg- Having a Delisted flag will trigger the inclusion of the delisted qualifier statement


In [21]:
object_qid = {'femrep':'Q55427776',
              'menrep': 'Q55427774',
              'devtox': 'Q72941151',
              'cancer': 'Q187661',
              'reptox': 'Q55427767'}

list_date = {'femrep':'Female Reproductive Toxicity - Date of Listing',
             'menrep':'Male Reproductive Toxicity - Date of Listing',
             'devtox':'Male Reproductive Toxicity - Date of Listing',
             'cancer': 'None',
             'reptox': 'None'}

url_base = 'https://oehha.ca.gov/chemicals/'
list_prop = "P31"

In [19]:
def check_listing (chem_info):
    runlist = []
    if chem_info['Cancer']=='Currently listed':
        runlist.append('cancer')
    if chem_info['Reproductive Toxicity']=='Currently listed':
        runlist.append('reptox')
    if chem_info['Female current']==True:
        runlist.append('femrep')     
    if chem_info['Male current']==True:
        runlist.append('menrep')
    if chem_info['Dev current']==True:
        runlist.append('devtox')
    ### Create statement for most-specific reprotox type whenever possible
    if ('femrep' in runlist) | ('menrep' in runlist) | ('devtox' in runlist):
        runlist.remove('reptox')
    
    return(runlist)


In [20]:
def check_delisting (chem_info):
    delisted = []
    if chem_info['Cancer delisted']==True:
        delisted.append('cancer')
    if chem_info['Rep delisted']==True:
        delisted.append('reptox')
    if chem_info['Dev delisted']==True:
        delisted.append('devtox')
    if chem_info['Female delisted']==True:
        delisted.append('femrep')
    if chem_info['Male delisted']==True:
        delisted.append('menrep')
        
    return (delisted)     

In [22]:
## Compare runlist and check list to determine status
runlist = check_listing(chem_info)
print(runlist)

delist_list = check_delisting(chem_info)
print(delist_list)


['femrep', 'menrep', 'devtox']
[]


In [24]:
print(delisted_in_wd[0:1])

             Title CAS Number           Cancer Cancer - Listing Mechanism  \
20  Allyl Chloride   107-05-1  Formerly listed                  AB-US EPA   

   Reproductive Toxicity Chemical listed under Proposition 65 as causing  \
20                  None                                            None   

   Developmental Toxicity - Date of Listing  \
20                                     None   

   Developmental Toxicity - Listing Mechanism  \
20                                       None   

   Female Reproductive Toxicity - Date of Listing  \
20                                           None   

   Female Reproductive Toxicity - Listing Mechanism      ...        \
20                                             None      ...         

   Dev current Male current Female current Cancer current  Rep current  \
20       False        False          False          False        False   

    Cancer delisted  Rep delisted  Dev delisted  Female delisted  \
20             True         False

In [26]:
for i in tqdm(range(len(delisted_in_wd[0:1]))):
    runlist = []
    runurl = url_base+delisted_in_wd.iloc[i]['url_stub']
    prop_65_qid = delisted_in_wd.iloc[i]['WDID']
    if delisted_in_wd.iloc[i]['Cancer']=='Formerly listed':
        runlist.append('cancer')
    if delisted_in_wd.iloc[i]['Reproductive Toxicity']=='Formerly listed':
        runlist.append('reptox')
    statements_to_add = []
    for j in range(len(runlist)):
        run_type = runlist[j]
        run_object_wdid = object_qid[run_type]
        runlist_date = str(delisted_in_wd.iloc[i][list_date[run_type]])
        print(runlist_date)
        reference = create_reference(runurl)
        if runlist_date!='None':
            list_date = wdi_core.WDTime(datetime.strptime(list_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P580', is_qualifier=True)
        
        delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)
        
        prop65_statement = wdi_core.WDItemID(value=run_object_wdid, prop_nr=list_prop,
                                             qualifiers = [delist_reason],
                                             references=[copy.deepcopy(reference)])
        statements_to_add.append(prop65_statement)
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=statements_to_add, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    listed_edits.append({'WDID':prop_65_qid,'type':'listed entry','revid':item.lastrevid})

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

NameError: name 'listed_edits' is not defined

In [None]:


runlist = check_listing(chem_info)
print(runlist)

delist_list = check_delisting(chem_info)
print(delist_list)

if delist_list == None:
    status = 'Currently Fully Listed'

    
    
    

In [None]:
    if delisted_in_wd.iloc[i]['Cancer']=='Formerly listed':
        runlist.append('cancer')
    if delisted_in_wd.iloc[i]['Reproductive Toxicity']=='Formerly listed':
        runlist.append('reptox')

In [None]:
part_delisted_dev_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Developmental Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Developmental Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Dev current']==False))]
part_delisted_fem_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Female Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Female Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Female current']==False))]
part_delisted_male_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Male Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Male Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Male current']==False))]   

### Writing fully delisted entities as deprecated statements

In [None]:
object_qid = {'femrep':'Q55427776',
              'menrep': 'Q55427774',
              'devtox': 'Q72941151',
              'cancer': 'Q187661',
              'reptox': 'Q55427767'}

list_date = {'femrep':'Female Reproductive Toxicity - Date of Listing',
             'menrep':'Male Reproductive Toxicity - Date of Listing',
             'devtox':'Male Reproductive Toxicity - Date of Listing',
             'cancer': 'None',
             'reptox': 'None'}

url_base = 'https://oehha.ca.gov/chemicals/'
list_prop = "P31"

In [None]:
for i in tqdm(range(len(delisted_in_wd[0:1]))):
    runlist = []
    runurl = url_base+delisted_in_wd.iloc[i]['url_stub']
    prop_65_qid = delisted_in_wd.iloc[i]['WDID']
    if delisted_in_wd.iloc[i]['Cancer']=='Formerly listed':
        runlist.append('cancer')
    if delisted_in_wd.iloc[i]['Reproductive Toxicity']=='Formerly listed':
        runlist.append('reptox')
    statements_to_add = []
    for j in range(len(runlist)):
        run_type = runlist[j]
        run_object_wdid = object_qid[run_type]
        runlist_date = str(delisted_in_wd.iloc[i][list_date[run_type]])
        reference = create_reference(runurl,runlist_date)
        delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)
        prop65_statement = wdi_core.WDItemID(value=run_object_wdid, prop_nr=list_prop,
                                             qualifiers = [delist_reason],
                                             references=[copy.deepcopy(reference)])
        statements_to_add.append(prop65_statement)
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=statements_to_add, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    listed_edits.append({'WDID':prop_65_qid,'type':'listed entry','revid':item.lastrevid})

In [None]:
"""
#### The above was created to avoid repeating the code for just a variable here and there


#### Create and write statements

delisted_edits = []

##Instance of Carcinogen, rank is deprecated
carcinogen_delisted = delisted_in_wd.loc[delisted_in_wd['Cancer']=='Formerly listed']
carcinogen_qid = 'Q187661'
for i in tqdm(range(len(carcinogen_delisted))):
    prop_65_qid = carcinogen_delisted.iloc[i]['WDID']
    prop_65_id = carcinogen_delisted.iloc[i]['url_stub']
    prop_65_url = url_base+prop_65_id
    reference = create_reference(prop_65_url)
    delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)
    prop65_statement = [wdi_core.WDItemID(value=carcinogen_qid, prop_nr=list_prop, rank='deprecated',
                                qualifiers = [delist_reason], references=[copy.deepcopy(reference)])]
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=prop65_statement, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    delisted_edits.append({'WDID':prop_65_qid,'type':'delisted entry','revid':item.lastrevid})
    
##Instance of Reproductive toxin, rank is deprecated
reprotox_delisted - delisted_in_wd.loc[delisted_in_wd['Reproductive Toxicity']=='Formerly listed']
reprotox_qid = 'Q55427767'
for i in tqdm(range(len(reprotox_delisted))):
    prop_65_qid = reprotox_delisted.iloc[i]['WDID']
    prop_65_id = reprotox_delisted.iloc[i]['url_stub']
    prop_65_url = url_base+prop_65_id
    reference = create_reference(prop_65_url)
    delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)
    prop65_statement = [wdi_core.WDItemID(value=reprotox_qid, prop_nr=list_prop, rank='deprecated',
                                qualifiers = [delist_reason], references=[copy.deepcopy(reference)])]
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=prop65_statement, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    previous_revision = item.lastrevid
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    delisted_edits.append({WDID:prop_65_qid,'type':'delisted entry','revid':item.lastrevid})
"""

In [None]:
for i in tqdm(range(len(not_delisted[0:1]))):
    runlist = []
    runurl = url_base+not_delisted.iloc[i]['url_stub']
    prop_65_qid = not_delisted.iloc[i]['WDID']
    if not_delisted.iloc[i]['Cancer']=='Currently listed':
        runlist.append('cancer')
    if not_delisted.iloc[i]['Female current']==True:
        runlist.append('femrep')
    if not_delisted.iloc[i]['Male current']==True:
        runlist.append('menrep')
    if not_delisted.iloc[i]['Dev current']==True:
        runlist.append('devtox')
    statements_to_add = []
    for j in range(len(runlist)):
        run_type = runlist[j]
        run_object_wdid = object_qid[run_type]
        runlist_date = str(not_delisted.iloc[i][list_date[run_type]])
        reference = create_reference(runurl,runlist_date) 
        prop65_statement = wdi_core.WDItemID(value=run_object_wdid, prop_nr=list_prop, 
                                              references=[copy.deepcopy(reference)])
        statements_to_add.append(prop65_statement)
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=statements_to_add, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    listed_edits.append({'WDID':prop_65_qid,'type':'listed entry','revid':item.lastrevid})

## Items under consideration or considered, but not listed
We can pull these, but it's not clear how they should be included in Wikidata

In [None]:
### Identify items that were considered, but not listed
considered_df = prop_65_mapped.loc[((prop_65_mapped['Cancer']=="Considered, but not listed") & (prop_65_mapped['Reproductive Toxicity']=="None"))|
                               ((prop_65_mapped['Cancer']=="None") & (prop_65_mapped['Reproductive Toxicity']=="Considered, but not listed"))]
considered_titles = considered_df['Title'].tolist()
#print(considered_df.head(n=5))

### Identify items that are under consideration
considering_df = prop_65_mapped.loc[((prop65_chems['Cancer']=="Under consideration") & (prop_65_mapped['Reproductive Toxicity']=="None"))|
                               ((prop_65_mapped['Cancer']=="None") & (prop_65_mapped['Reproductive Toxicity']=="Under consideration"))]
considering_titles = considering_df['Title'].tolist()
#print(considering_df.head(n=5))
print("Considered, not listed: ",len(considered_df),"Under consideration: ", len(considering_df))


## Partially delisted items

We can filter for these by removing items that were completely delisted, and items that were considered, or under consideration. Next, we'll need to count the number of entries under "Chemical listed under Proposition 65 as causing", and checking to see if the same number of columns are empty, or if there are more columns not empty than there are number of entries under "Chemical listed under Proposition 65 as causing"

Or items which are delisted under either Cancer or Reproductive Toxicity but is NOT empty for the other item

In [None]:
### Remove entries which were completely delisted, are under consideration, or considered, and not listed
prop_65_listed = prop_65_mapped.loc[~prop_65_mapped['Title'].isin(delisted_titles+considered_titles+considering_titles)].copy()
print("Items currently listed under prop 65: ", len(prop_65_listed))

### Identify cancer vs reproductive partially delisted items
part_delisted_df = prop_65_listed.loc[((prop_65_listed['Cancer']=="Formerly listed") & (prop_65_listed['Reproductive Toxicity']=="Currently listed"))|
                               ((prop_65_listed['Cancer']=="Currently listed") & (prop_65_listed['Reproductive Toxicity']=="Formerly listed"))]
part_delisted_titles = part_delisted_df['Title'].tolist()
#print(part_delisted_df.head(n=5))
print("Items partially delisted for cancer or reproductive toxicity: ", len(part_delisted_df))

In [None]:
### Identify items that were partially delisted for one type of reproductive toxicity or another
prop_65_listed['Dev current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Development")
prop_65_listed['Male current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Male")
prop_65_listed['Female current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Female")

#print(prop_65_listed.head(n=2))
### These can be identified as items which are not none for date of list/listing mechanism 
### for a particular type of toxicity, but is listed as "False" for the corresponding toxicity
part_delisted_dev_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Developmental Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Developmental Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Dev current']==False))]
part_delisted_fem_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Female Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Female Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Female current']==False))]
part_delisted_male_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Male Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Male Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Male current']==False))]   
#print(part_delisted_dev_df.head(n=2))

In [None]:
### Identify items that have not been delisted at all
part_delisted_dev_titles = part_delisted_dev_df['Title'].tolist()
part_delisted_fem_titles = part_delisted_fem_df['Title'].tolist()
part_delisted_male_titles = part_delisted_male_df['Title'].tolist()
not_delisted = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_dev_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_fem_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_male_titles))]
print(len(not_delisted))

# Current Listings
The initial run should write both the listed and delisted entities. The tables should be stored so that future exports can be compared prior runs to minimize the actual number of writes needed to keep the data up-to-date. The normalization of entities will depend on assignments by Mix N Match

How to handle the listing and delisting dates via the references

* Chemical causes cancer --> Instance (P31) of carcinogen
* Chemical causes developmental toxicity --> Instance (P31) of developmental toxicant
* Chemical causes reproductive toxicity --> Instance (P31) of reproductive toxicant
* Statement date --> retrieved (P813) : access date 
* Date listed --> start time (P580) : from date 
* Date delisted --> end time (P582) : end date 
* Delisted --> reason for deprecation (P2241) in conjunction with disqualification (Q1229261)

### Initial run for current listings

In [None]:
listed_edits = []

for i in tqdm(range(len(not_delisted[0:1]))):
    runlist = []
    runurl = url_base+not_delisted.iloc[i]['url_stub']
    prop_65_qid = not_delisted.iloc[i]['WDID']
    if not_delisted.iloc[i]['Cancer']=='Currently listed':
        runlist.append('cancer')
    if not_delisted.iloc[i]['Female current']==True:
        runlist.append('femrep')
    if not_delisted.iloc[i]['Male current']==True:
        runlist.append('menrep')
    if not_delisted.iloc[i]['Dev current']==True:
        runlist.append('devtox')
    statements_to_add = []
    for j in range(len(runlist)):
        run_type = runlist[j]
        run_object_wdid = object_qid[run_type]
        runlist_date = str(not_delisted.iloc[i][list_date[run_type]])
        reference = create_reference(runurl,runlist_date) 
        prop65_statement = wdi_core.WDItemID(value=run_object_wdid, prop_nr=list_prop, 
                                              references=[copy.deepcopy(reference)])
        statements_to_add.append(prop65_statement)
    item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=statements_to_add, append_value=list_prop,
                             global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login, edit_summary="added data from corresponding CA prop 65 info page.")
    listed_edits.append({'WDID':prop_65_qid,'type':'listed entry','revid':item.lastrevid})

##Successful run: https://www.wikidata.org/wiki/Q27888393

### Initial run for partially delisted items

### Export results for future investigations

# Scheduled Runs
The maintenance runs should parse the data similar to the previous runs and compare the results to look for new entries to add and new delistings to deprecate.

In [None]:
## Unit test


carcinogen_qid = 'Q187661'
devtox_qid = 'Q72941151'
femreptox_qid = 'Q55427776'
malereptox_qid = 'Q55427774'

prop_65_url = 'https://oehha.ca.gov/proposition-65/chemicals/abiraterone-acetate'
prop_65_id = 'abiraterone-acetate'
prop_65_qid = 'Q4115189' #'Q27888393'
reference = create_reference(ghr_url)
list_prop = "P31" 
start_date = '04/08/2016'
delist_date = '4/19/2013'

list_qualifier = wdi_core.WDTime(datetime.strptime(start_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P580', is_qualifier=True)
dev_statement = [wdi_core.WDString(value=devtox_qid, prop_nr=list_prop, 
                               qualifiers=[list_qualifier],
                               references=[copy.deepcopy(reference)])]

item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=dev_statement, append_value=list_prop,
                           global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)




In [None]:
"""
## Unit test --  write a statement
disease_qid = 'Q4115189' #'Q2703116'
ghr_url = 'https://ghr.nlm.nih.gov/condition/15q11-q13-duplication-syndrome'
ghr_id = '15q11-q13-duplication-syndrome'
reference = create_reference(ghr_url)
url_prop = "P7464" 
start_date = '4/11/2013'
delist_date = '4/19/2013'


list_qualifier = wdi_core.WDTime(datetime.strptime(start_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P580', is_qualifier=True)
delist_qualifier = wdi_core.WDTime(datetime.strptime(delist_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P582', is_qualifier=True)
delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)

statement = [wdi_core.WDString(value=ghr_id, prop_nr=url_prop, rank='deprecated', 
                               qualifiers=[list_qualifier,delist_qualifier,delist_reason],
                               references=[copy.deepcopy(reference)])]
item = wdi_core.WDItemEngine(wd_item_id=disease_qid, data=statement, append_value=url_prop,
                           global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
item.write(login)
print(ghr_id, disease_qid, ghr_url)
  


"""

In [None]:
### How deprecations are handled in Gene Bot (which uses WDI):
"""
def remove_deprecated_statements(qid, frc, releases, last_updated, props, login):

#    :param qid: qid of item
#    :param frc: a fastrun container
#    :param releases: list of releases to remove (a statement that has a reference that is stated in one of these
#            releases will be removed)
#    :param last_updated: looks like {'Q20641742': datetime.date(2017,5,6)}. a statement that has a reference that is
#            stated in Q20641742 (entrez) and was retrieved more than DAYS before 2017-5-6 will be removed
#    :param props: look at these props
#    :param login:
#    :return:
    for prop in props:
        frc.write_required([wdi_core.WDString("fake value", prop)])
    orig_statements = frc.reconstruct_statements(qid)
    releases = set(int(r[1:]) for r in releases)

    s_dep = []
    for s in orig_statements:
        if any(any(x.get_prop_nr() == 'P248' and x.get_value() in releases for x in r) for r in s.get_references()):
            setattr(s, 'remove', '')
            s_dep.append(s)
        else:
            for r in s.get_references():
                dbs = [x.get_value() for x in r if x.get_value() in last_updated]
                if dbs:
                    db = dbs[0]
                    if any(x.get_prop_nr() == 'P813' and last_updated[db] - x.get_value() > DAYS for x in r):
                        setattr(s, 'remove', '')
                        s_dep.append(s)
    if s_dep:
        print("-----")
        print(qid)
        print(len(s_dep))
        print([(x.get_prop_nr(), x.value) for x in s_dep])
        print([(x.get_references()[0]) for x in s_dep])
        wd_item = wdi_core.WDItemEngine(wd_item_id=qid, data=s_dep, fast_run=False)
        wdi_helpers.try_write(wd_item, '', '', login, edit_summary="remove deprecated statements")
"""

### How to get rank using WDI
"""
item = wdi_core.WDItemEngine(wd_item_id=qid)
new_ss = []
for s in item.statements:  # type: wdi_core.WDBaseDataType
    if s.get_rank() != "normal":
        continue
"""


### How to handled deprecations using pywikibot
"""
https://doc.wikimedia.org/pywikibot/master/_modules/pywikibot/page.html#Claim.changeRank
"""

### WDI rank handling
"""
type rank: A string of one of three allowed values: 'normal', 'deprecated', 'preferred'


    def get_rank(self):
        if self.is_qualifier or self.is_reference:
            return ''
        else:
            return self.rank

    def set_rank(self, rank):
        if self.is_qualifier or self.is_reference:
            raise ValueError('References or qualifiers do not have ranks')

        valid_ranks = ['normal', 'deprecated', 'preferred']

        if rank not in valid_ranks:
            raise ValueError('{} not a valid rank'.format(rank))

        self.rank = rank

"""

In [None]:
def run_one(taxid, genbank_id):
    # get the QID
    taxid = str(taxid)
    if taxid not in tax_qid_map:
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "organism with taxid {} not found or skipped".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)
        return None
    qid = tax_qid_map[taxid]
    reference = create_reference(genbank_id)
    genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
    
    # create the item object, specifying the qid
    item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=qid, fast_run=True, 
                                 fast_run_base_filter={PROPS['GenBank Assembly accession']: ''})

    wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                          login=login, edit_summary="update GenBank Assembly accession")
    

def run_one(taxid, genbank_id):
    # create a statement for the ncbi tax id
    ncbi_statement = wdi_core.WDExternalID(str(taxid), PROPS['NCBI Taxonomy ID'])
    # we are going to retrieve the item to be modified based on the NCBI Taxonomy ID, which should already exist on all organisms.
    try:
        item = wdi_core.WDItemEngine(data=[ncbi_statement], domain="organism", search_only=True, item_name="organism")
    except wdi_core.ManualInterventionReqException as e:
        # if there are more than one items with this ncbi tax id, this will throw an error!
        # instead, catch it and log the error
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "", str(e), type(e))
        wdi_core.WDItemEngine.log("ERROR", msg)
        return
    
    if item.wd_item_id:
        # if the item exists, create the genbank statement
        reference = create_reference(genbank_id)
        genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
        # create the item object, specifying the qid
        item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=item.wd_item_id)
        # use this helper method to perform the write. It automatically writes to a log file and captures errors
        # wdi also has an automatic backoff and retry functionality
        wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                              login=login, edit_summary="update GenBank Assembly accession")
    else:
        # if the item doesn't exist, log it and skip
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "No organism found with taxid {}".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)