# CA Prop 65 Chemical categorization
California Proposition 65 requires labeling for chemical acknowledged in the state of California as causing cancer or reproductive/developmental toxicity. This enables us to categorize the chemicals on the Prop 65 list as instances of 'carcinogen', 'reproductive toxicant', 'male reproductive toxicant', 'female reproductive toxicant', and 'developmental toxicant' in Wikidata adding some very basic (and somewhat indirect) chem-disease relationships

Note that California's OEHHA allows users to download a pre-exported .csv file of chemicals listed under Prop 65 (does not include de-listed chemicals). Alternatively, users can export the complete list of chemicals from OEHHA which will include chemicals that are under consideration, currently listed, or formerly listed.

This notebook partially explores both exports for the best way for loading the data into Wikidata. The final bot will likely NOT include both methods.

In [2]:
from wikidataintegrator import wdi_core, wdi_login, wdi_helpers
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs
import pandas as pd
from pandas import read_csv
import requests
from tqdm.notebook import trange, tqdm
import ipywidgets 
import widgetsnbextension
import xml.etree.ElementTree as et 
import time


In [2]:
from datetime import datetime
import copy
def create_reference(prop65_url):
    refStatedIn = wdi_core.WDItemID(value="Q28455381", prop_nr="P248", is_reference=True)
    timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")
    refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True)
    refURL = wdi_core.WDUrl(value=prop65_url, prop_nr="P854", is_reference=True)

    return [refStatedIn, refRetrieved, refURL]

In [None]:
"""
## Login for Scheduled bot
print("Logging in...")
try:
    from scheduled_bots.local import WDUSER, WDPASS
except ImportError:
    if "WDUSER" in os.environ and "WDPASS" in os.environ:
        WDUSER = os.environ['WDUSER']
        WDPASS = os.environ['WDPASS']
    else:
        raise ValueError("WDUSER and WDPASS must be specified in local.py or as environment variables")
"""

In [None]:
print("Logging in...")
import wdi_user_config ## Credentials stored in a wdi_user_config file
login_dict = wdi_user_config.get_credentials()
login = wdi_login.WDLogin(login_dict['WDUSER'], login_dict['WDPASS'])


In [None]:
## Set up logging

wdi_core.WDItemEngine.setup_logging(header=json.dumps({'name': 'prop_65', 
                                                       'timestamp': str(datetime.now()), 
                                                       'run_id': str(datetime.now())}))

# Get the most recent csv file and parse it

## Prop 65 list method
The updates appear to be posted on this site: https://oehha.ca.gov/proposition-65/proposition-65-list
Since the updated data file name changes depending on the update date, it would be nice if the newer file name could be scraped from this page instead of being hard-coded in. It would be nice if date could also be scraped from this page and used for comparison to determine if changes were made and if an update is needed.

Unfortunately, this page uses iframes and is protected from robots using captchas (Incapsula), so the csv should be manually downloaded to the data folder and named by the update date (YYYY-MM-DD.csv) prior to running the bot

In [3]:
datasrc = 'data/2019-09-13.csv'

header_junk = 11 ## Note, the number of rows to skip may change
tail_junk = 6

chem_list = read_csv(datasrc, skiprows=header_junk, encoding = 'unicode_escape') 
chem_list.dropna(axis='columns', how='all',inplace=True)
chem_list.fillna("None", inplace=True)
chem_list.drop_duplicates(keep='first',inplace=True)
## Filter out blank entries
chem_list_clean = chem_list.loc[(chem_list['Chemical']!="None")].copy()
chem_list_clean.drop(chem_list_clean.tail(tail_junk).index,inplace=True)
print(chem_list_clean.tail(n=3))

                                               Chemical  \
1003                                   Zidovudine (AZT)   
1004                                          Zileuton    
1005  Zineb  Delisted October 29, 1999 [Click here f...   

                   Type of Toxicity Listing Mechanism      CAS No.  \
1003                        cancer                 LC   30516-87-1   
1004  cancer, developmental, female                FR  111406-87-2   
1005                         cancer                AB   12122-67-7   

     Date Listed NSRL or MADL (µg/day)a  
1003   18-Dec-09                   None  
1004   22-Dec-00                   None  
1005    1-Jan-90                   None  


Note that the csv table has sub-entries which appear as seperate entries, but should retain the header entry data except for the last field.  These will need to be accounted for. As of the csv file dated from 2019.09.13, these entries are preceded with some spaces. Another way to filter for them is to pull entries which have "None" for everything but 'chemical name' and 'NSRL or MADL'

In [4]:
sub_entries = chem_list_clean.loc[(chem_list_clean['Chemical'].str.contains("  ")) &
                                  (chem_list_clean['Type of Toxicity']=="None") &
                                  (chem_list_clean['Listing Mechanism']=="None")]
print(sub_entries.head(n=3))
print(len(sub_entries))

                Chemical Type of Toxicity Listing Mechanism CAS No.  \
97            Beryllium              None              None    None   
98       Beryllium oxide             None              None    None   
99     Beryllium sulfate             None              None    None   

   Date Listed NSRL or MADL (µg/day)a  
97        None                    0.1  
98        None                    0.1  
99        None                 0.0002  
22


The actual toxicity information for these sub-entries can actually be found in the header entry.  To find these, obtain the index values for these entries. If entries are sequential, the header-entry should be 1 less the smallest index value. The values could be copied over, though the CAS number should not be copied over.

Alternatively, entries without CAS numbers can be assumed to have sub-entries and can be ignored, but doing this will skip entries which do have profiles, but not CAS numbers. Eg- Alcoholic Beverages which doesn't have a CAS No. but will likely be appropriately matched via Mix N Match. As a result, we should be able to cover cases like this even if it doesn't have a CAS number.

That said, aggregate entries only have 1 prop 65 page and multiple Wikidata pages, which would result in one to many mappings. Since it's unclear how these will be handled by the Mix N match community, the first pass should ignore the aggregate entries (which can be identified as mentioned above)

### Chemical names to URL conversion
The property in Wikidata uses the URL stub as ID so we'll need to convert the Chemical names to url stubs that work with prop65 website. The urls will then be mapped to Wikidata entries with the property that were added via Mix N match. Normally, urls can be tested, but CA Prop 65 website has captcha protection and blocks scrapers.

Also, note that the items that have been de-listed have urls, but their urls are no longer readily available (ie- the urls work, but are not necessarily listed)

Example conversion:
* "A-alpha-C (2-Amino-9H-pyrido[2,3-b]indole)" --> "alpha-c-2-amino-9h-pyrido23-bindole"
* "Altretamine" --> "altretamine"
* "Allyl chloride  Delisted October 29, 1999 [Click here for the basis for delisting]" --> "allyl-chloride"
* "p-Aminoazobenzene" --> "p-aminoazobenzene"
* "4-Aminobiphenyl (4-aminodiphenyl)" --> "4-aminobiphenyl-4-aminodiphenyl"
* "2-Amino-5-(5-nitro-2-furyl)-1,3,4-thiadiazole" --> "2-amino-5-5-nitro-2-furyl-134-thiadiazole"
* "?-Methyl styrene (alpha-Methylstyrene)" --> "methyl-styrene-alpha-methylstyrene"
* "N,N'-Diacetylbenzidine" --> "nn-diacetylbenzidine"
* "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline"
* "2?Mercaptobenzothiazole" --> "2-mercaptobenzothiazole"
* "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1"
* "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages"
* "Aspirin (NOTE:  It is especially  important not to use aspirin during the last three months of pregnancy,  unless specifically directed to do so by a physician because it may cause  problems in the unborn child or  complications during delivery.)" --> "aspirin"

Clean up miscellenous notes included in the chemical names such as 'delistings, delisting dates, changes in listings'. This should be added as deprecation of the statements. 

Because of all these issues, we'll work with the OEHHA list rather than the Prop 65 list

## CA OEHHA clean up method
The manually triggered export of chemical list from the OEHHA site has less header junk, random title notes, and other things that disrupt the structure which would make the name convers easier. Additionally, the data on the cancer, reproductive toxicity, etc. is more structured, and doesn't have random blank spaces

In [5]:
datasrc = 'data/OEHHA-2019-11-1.csv'

chem_list = read_csv(datasrc, encoding = 'unicode_escape', header=0) 
chem_list.dropna(axis='columns', how='all',inplace=True)
chem_list.fillna("None", inplace=True)
#print(chem_list.columns.values)

## Pull out only columns of interest for our task
cols_of_interest = chem_list[['Title','CAS Number','Cancer','Cancer - Listing Mechanism',
                          'Reproductive Toxicity','Chemical listed under Proposition 65 as causing',
                          'Developmental Toxicity - Date of Listing','Developmental Toxicity - Listing Mechanism',
                          'Female Reproductive Toxicity - Date of Listing',
                          'Female Reproductive Toxicity - Listing Mechanism',
                          'Male Reproductive Toxicity - Date of Listing',
                          'Male Reproductive Toxicity - Listing Mechanism']]

## Remove entries which are not relevant
prop_65_irrelevant = cols_of_interest.loc[(cols_of_interest['Cancer'] == "None") & 
                                          (cols_of_interest['Reproductive Toxicity'] == "None") & 
                                          (cols_of_interest['Chemical listed under Proposition 65 as causing'] == "None")]
non_prop_chems = prop_65_irrelevant['Title'].tolist()
prop65_chems = cols_of_interest.loc[~cols_of_interest['Title'].isin(non_prop_chems)].copy()
#print(prop65_chems.head(n=2))

### Chemical names to URL conversion
The property in Wikidata uses the URL stub as ID so we'll need to convert the Chemical names to url stubs that work with prop65 website. The urls will then be mapped to Wikidata entries with the property that were added via Mix N match. Normally, urls can be tested, but CA Prop 65 website has captcha protection and blocks scrapers.

Example conversion:
"OEHHA listing" --> "OEHHA url" | "Prop 65 listing" --> "Prop 65 url"
* "Amino-alpha-carboline" --> "amino-alpha-carboline" | "A-alpha-C (2-Amino-9H-pyrido[2,3-b]indole)" --> "alpha-c-2-amino-9h-pyrido23-bindole"
* "Allyl Chloride" --> "allyl-chloride" | not listed
* "alpha-Methylstyrene" --> "alpha-methylstyrene" | "?-Methyl styrene (alpha-Methylstyrene)" --> "methyl-styrene-alpha-methylstyrene"
* "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline" | "MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)" --> "meiqx-2-amino-38-dimethylimidazo45-fquinoxaline"
* "2-Mercaptobenzothiazole" --> "2-mercaptobenzothiazole" | "2?Mercaptobenzothiazole" --> "2-mercaptobenzothiazole"
* "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1" | "Trp-P-1 (Tryptophan-P-1)" --> "trp-p-1-tryptophan-p-1"
* "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages" | "Alcoholic beverages, when associated with alcohol abuse" --> "alcoholic-beverages"
* "Aspirin" --> "aspirin" | "Aspirin (NOTE:  It is especially  important not to use aspirin during the last three months of pregnancy,  unless specifically directed to do so by a physician because it may cause  problems in the unborn child or  complications during delivery.)" --> "aspirin"

In [12]:
## To convert the title to a url stub, lower case it, strip out parenthesis, brackets, and commas, and replace spaces with dashes
prop65_chems['url_stub'] = prop65_chems['Title'].str.lower().str.replace("[","").str.replace("]","").str.replace(",","").str.replace("(","").str.replace(")","").str.strip("]").str.replace(" ","-")
#print(prop65_chems.head())

## Check the look of the url stub
#print(prop65_chems.loc[prop65_chems['Title']=="Allyl Chloride"])
#print(prop65_chems.loc[prop65_chems['Title']=="Trp-P-1 (Tryptophan-P-1)"])
#print(prop65_chems.loc[prop65_chems['Title']=="MeIQx (2-Amino-3,8-dimethylimidazo[4,5-f]quinoxaline)"])
#print(prop65_chems.head(n=2))

#print(prop65_chems.head(n=2))
mixnmatch_cat = prop65_chems[['url_stub','Title','CAS Number']]
mixnmatch_cat.rename(columns={'url_stub':'Entry ID','Title':'Entry name'}, inplace=True)
mixnmatch_cat['Entry description'] = mixnmatch_cat['Entry name'].astype(str).str.cat(mixnmatch_cat['CAS Number'].astype(str),sep=", CAS Number: ")
mixnmatch_cat.drop('CAS Number',axis=1,inplace=True)
print(mixnmatch_cat.head(n=2))

mixnmatch_cat.to_csv('data/mixnmatch_cat.tsv',sep='\t', header=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


              Entry ID           Entry name  \
0  abiraterone-acetate  Abiraterone acetate   
2         acetaldehyde         Acetaldehyde   

                              Entry description  
0  Abiraterone acetate, CAS Number: 154229-18-2  
2             Acetaldehyde, CAS Number: 75-07-0  


## Delisted items

After scanning the data, it appears that this table will allow us to easily identify entities which were 100% delisted, but it will require a bit more logic to identify entities that have been delisted for some conditions but not others.

Some sample comparisons between the Prop 65 list and the OEHHA list:

As seen in Prop 65 list, we can see that BPA was listed as causing female reproductive and developmental toxicity, but it's listing as a developmental toxicant was removed on April 19, 2013.
* Bisphenol A (BPA)	female
* Bisphenol A (BPA)  Delisted April 19, 2013 developmental

In contrast its (Bisphenol A) entry in the OEHHA list is illustrated as follows:
* Reproductive Toxicity: Currently listed
* Chemical listed under Proposition 65 as causing: Female Reproductive Toxicity
* Developmental Toxicity - Date of Listing: 4/11/2013
* Developmental Toxicity - Listing Mechanism: AB-NTP-CERHR
* Female Reproductive Toxicity - Date of Listing:	5/11/2015
* Female Reproductive Toxicity - Listing Mechanism: SQE

As seen above, the delisting as a developmental toxicity item has to be inferred based on it's listing as causing only Female Reproductive Toxicity event though it has entries for dates and mechanism for Developmental toxicity entries.

Completely delisted entries are more straightforward as seen in the case of Allyl Chloride:
This entity is not even listed in the Prop 65 list. In contrast, in the OEHHA list it appears as:
* Cancer: Formerly listed
* Cancer - Listing Mechanism: AB-US EPA 	

In [5]:
### Identify completely delisted items
delisted_df = prop65_chems.loc[((prop65_chems['Cancer']=="Formerly listed") & (prop65_chems['Reproductive Toxicity']=="None"))|
                               ((prop65_chems['Cancer']=="None") & (prop65_chems['Reproductive Toxicity']=="Formerly listed"))]
delisted_titles = delisted_df['Title'].tolist()
#print(delisted_df.head(n=5))
print(len(delisted_df))

23


## Items under consideration or considered, but not listed
We can pull these, but it's not clear how they should be included in Wikidata

In [6]:
### Identify items that were considered, but not listed
considered_df = prop65_chems.loc[((prop65_chems['Cancer']=="Considered, but not listed") & (prop65_chems['Reproductive Toxicity']=="None"))|
                               ((prop65_chems['Cancer']=="None") & (prop65_chems['Reproductive Toxicity']=="Considered, but not listed"))]
considered_titles = considered_df['Title'].tolist()
#print(considered_df.head(n=5))

### Identify items that are under consideration
considering_df = prop65_chems.loc[((prop65_chems['Cancer']=="Under consideration") & (prop65_chems['Reproductive Toxicity']=="None"))|
                               ((prop65_chems['Cancer']=="None") & (prop65_chems['Reproductive Toxicity']=="Under consideration"))]
considering_titles = considering_df['Title'].tolist()
#print(considering_df.head(n=5))
print("Considered, not listed: ",len(considered_df),"Under consideration: ", len(considering_df))


Considered, not listed:  60 Under consideration:  4


## Partially delisted items

We can filter for these by removing items that were completely delisted, and items that were considered, or under consideration. Next, we'll need to count the number of entries under "Chemical listed under Proposition 65 as causing", and checking to see if the same number of columns are empty, or if there are more columns not empty than there are number of entries under "Chemical listed under Proposition 65 as causing"

Or items which are delisted under either Cancer or Reproductive Toxicity but is NOT empty for the other item

In [7]:
### Remove entries which were completely delisted, are under consideration, or considered, and not listed
prop_65_listed = prop65_chems.loc[~prop65_chems['Title'].isin(delisted_titles+considered_titles+considering_titles)].copy()
print("Items currently listed under prop 65: ", len(prop_65_listed))


### Identify cancer vs reproductive partially delisted items
part_delisted_df = prop_65_listed.loc[((prop_65_listed['Cancer']=="Formerly listed") & (prop_65_listed['Reproductive Toxicity']=="Currently listed"))|
                               ((prop_65_listed['Cancer']=="Currently listed") & (prop_65_listed['Reproductive Toxicity']=="Formerly listed"))]
part_delisted_titles = part_delisted_df['Title'].tolist()
#print(part_delisted_df.head(n=5))
print("Items partially delisted for cancer or reproductive toxicity: ", len(part_delisted_df))

Items currently listed under prop 65:  851
Items partially delisted for cancer or reproductive toxicity:  2


In [12]:
### Identify items that were partially delisted for one type of reproductive toxicity or another
prop_65_listed['Dev current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Development")
prop_65_listed['Male current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Male")
prop_65_listed['Female current'] = prop_65_listed['Chemical listed under Proposition 65 as causing'].str.contains("Female")

#print(prop_65_listed.head(n=2))
### These can be identified as items which are not none for date of list/listing mechanism 
### for a particular type of toxicity, but is listed as "False" for the corresponding toxicity
part_delisted_dev_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Developmental Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Developmental Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Dev current']==False))]
part_delisted_fem_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Female Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Female Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Female current']==False))]
part_delisted_male_df = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles)) &
                                            (((prop_65_listed['Male Reproductive Toxicity - Date of Listing']!="None")|
                                             (prop_65_listed['Male Reproductive Toxicity - Listing Mechanism']!="None"))&
                                             (prop_65_listed['Male current']==False))]   
#print(part_delisted_dev_df.head(n=2))

In [17]:
### Identify items that have not been delisted at all
part_delisted_dev_titles = part_delisted_dev_df['Title'].tolist()
part_delisted_fem_titles = part_delisted_fem_df['Title'].tolist()
part_delisted_male_titles = part_delisted_male_df['Title'].tolist()
not_delisted = prop_65_listed.loc[(~prop_65_listed['Title'].isin(part_delisted_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_dev_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_fem_titles))&
                                  (~prop_65_listed['Title'].isin(part_delisted_male_titles))]
print(len(not_delisted))

844


# Initial Run
The initial run should write both the listed and delisted entities. The tables should be stored so that future exports can be compared prior runs to minimize the actual number of writes needed to keep the data up-to-date. The normalization of entities will depend on assignments by Mix N Match

How to handle the listing and delisting dates via the references

* Chemical causes cancer --> Instance (P31) of carcinogen
* Chemical causes developmental toxicity --> Instance (P31) of developmental toxicant
* Chemical causes reproductive toxicity --> Instance (P31) of reproductive toxicant
* Statement date --> retrieved (P813) : access date 
* Date listed --> start time (P580) : from date 
* Date delisted --> end time (P582) : end date 
* Delisted --> reason for deprecation (P2241) in conjunction with disqualification (Q1229261)

In [None]:
## Run sparql query to pull all entities with Prop 65 ID (Read Only Run)
prop65_urls = prop65_chems['url_stub'].tolist()
i=0
wdmap = []
wdmapfail = []
for i in tqdm(range(len(prop65_urls))):
    prop65_id = prop65_urls[i]
    try:
        sparqlQuery = "SELECT * WHERE {?topic wdt:PXXX \""+prop65_id+"\"}"
        result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
        query_qid = result["results"]["bindings"][0]["topic"]["value"].replace("http://www.wikidata.org/entity/", "")
        wdmap.append({'url_stub':prop65_id,'WDID':query_qid})
    except:
        wdmapfail.append(prop65_id)
    i=i+1

## Inspect the results for mapping or coverage issues
wdid_prop65_df = pd.DataFrame(wdmap)
print("resulting mapping table has: ",len(wdid_prop65_df)," rows.")

In [None]:
## Perform left merge for currently listed and partially delisted items
prop_65_mapped = prop_65_listed.merge(wdid_prop65_df, on='url_stub', how='left')

### Initial run for current listings

In [None]:
## Unit test


carcinogen_qid = 'Q187661'
devtox_qid = 'Q72941151'
femreptox_qid = 'Q55427776'
malereptox_qid = 'Q55427774'

prop_65_url = 'https://oehha.ca.gov/proposition-65/chemicals/abiraterone-acetate'
prop_65_id = 'abiraterone-acetate'
prop_65_qid = 'Q4115189' #'Q27888393'
reference = create_reference(ghr_url)
list_prop = "P31" 
start_date = '04/08/2016'
delist_date = '4/19/2013'

list_qualifier = wdi_core.WDTime(datetime.strptime(start_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P580', is_qualifier=True)
dev_statement = [wdi_core.WDString(value=devtox_qid, prop_nr=list_prop, 
                               qualifiers=[list_qualifier],
                               references=[copy.deepcopy(reference)])]

item = wdi_core.WDItemEngine(wd_item_id=prop_65_qid, data=dev_statement, append_value=prop_65_url,
                           global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)




### Initial run for completely delisted items

### Initial run for partially delisted items

### Export results for future investigations

# Scheduled Runs
The maintenance runs should parse the data similar to the previous runs and compare the results to look for new entries to add and new delistings to deprecate.

In [None]:
"""
## Unit test --  write a statement
disease_qid = 'Q4115189' #'Q2703116'
ghr_url = 'https://ghr.nlm.nih.gov/condition/15q11-q13-duplication-syndrome'
ghr_id = '15q11-q13-duplication-syndrome'
reference = create_reference(ghr_url)
url_prop = "P7464" 
start_date = '4/11/2013'
delist_date = '4/19/2013'


list_qualifier = wdi_core.WDTime(datetime.strptime(start_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P580', is_qualifier=True)
delist_qualifier = wdi_core.WDTime(datetime.strptime(delist_date,'%m/%d/%Y').strftime("+%Y-%m-%dT00:00:00Z"), prop_nr='P582', is_qualifier=True)
delist_reason = wdi_core.WDItemID('Q56478729', prop_nr='P2241', is_qualifier=True)

statement = [wdi_core.WDString(value=ghr_id, prop_nr=url_prop, rank='deprecated', 
                               qualifiers=[list_qualifier,delist_qualifier,delist_reason],
                               references=[copy.deepcopy(reference)])]
item = wdi_core.WDItemEngine(wd_item_id=disease_qid, data=statement, append_value=url_prop,
                           global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
item.write(login)
print(ghr_id, disease_qid, ghr_url)
  


"""

In [None]:
### How deprecations are handled in Gene Bot (which uses WDI):
"""
def remove_deprecated_statements(qid, frc, releases, last_updated, props, login):

#    :param qid: qid of item
#    :param frc: a fastrun container
#    :param releases: list of releases to remove (a statement that has a reference that is stated in one of these
#            releases will be removed)
#    :param last_updated: looks like {'Q20641742': datetime.date(2017,5,6)}. a statement that has a reference that is
#            stated in Q20641742 (entrez) and was retrieved more than DAYS before 2017-5-6 will be removed
#    :param props: look at these props
#    :param login:
#    :return:
    for prop in props:
        frc.write_required([wdi_core.WDString("fake value", prop)])
    orig_statements = frc.reconstruct_statements(qid)
    releases = set(int(r[1:]) for r in releases)

    s_dep = []
    for s in orig_statements:
        if any(any(x.get_prop_nr() == 'P248' and x.get_value() in releases for x in r) for r in s.get_references()):
            setattr(s, 'remove', '')
            s_dep.append(s)
        else:
            for r in s.get_references():
                dbs = [x.get_value() for x in r if x.get_value() in last_updated]
                if dbs:
                    db = dbs[0]
                    if any(x.get_prop_nr() == 'P813' and last_updated[db] - x.get_value() > DAYS for x in r):
                        setattr(s, 'remove', '')
                        s_dep.append(s)
    if s_dep:
        print("-----")
        print(qid)
        print(len(s_dep))
        print([(x.get_prop_nr(), x.value) for x in s_dep])
        print([(x.get_references()[0]) for x in s_dep])
        wd_item = wdi_core.WDItemEngine(wd_item_id=qid, data=s_dep, fast_run=False)
        wdi_helpers.try_write(wd_item, '', '', login, edit_summary="remove deprecated statements")
"""

### How to get rank using WDI
"""
item = wdi_core.WDItemEngine(wd_item_id=qid)
new_ss = []
for s in item.statements:  # type: wdi_core.WDBaseDataType
    if s.get_rank() != "normal":
        continue
"""


### How to handled deprecations using pywikibot
"""
https://doc.wikimedia.org/pywikibot/master/_modules/pywikibot/page.html#Claim.changeRank
"""

### WDI rank handling
"""
type rank: A string of one of three allowed values: 'normal', 'deprecated', 'preferred'


    def get_rank(self):
        if self.is_qualifier or self.is_reference:
            return ''
        else:
            return self.rank

    def set_rank(self, rank):
        if self.is_qualifier or self.is_reference:
            raise ValueError('References or qualifiers do not have ranks')

        valid_ranks = ['normal', 'deprecated', 'preferred']

        if rank not in valid_ranks:
            raise ValueError('{} not a valid rank'.format(rank))

        self.rank = rank

"""

In [None]:
def run_one(taxid, genbank_id):
    # get the QID
    taxid = str(taxid)
    if taxid not in tax_qid_map:
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "organism with taxid {} not found or skipped".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)
        return None
    qid = tax_qid_map[taxid]
    reference = create_reference(genbank_id)
    genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
    
    # create the item object, specifying the qid
    item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=qid, fast_run=True, 
                                 fast_run_base_filter={PROPS['GenBank Assembly accession']: ''})

    wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                          login=login, edit_summary="update GenBank Assembly accession")
    

def run_one(taxid, genbank_id):
    # create a statement for the ncbi tax id
    ncbi_statement = wdi_core.WDExternalID(str(taxid), PROPS['NCBI Taxonomy ID'])
    # we are going to retrieve the item to be modified based on the NCBI Taxonomy ID, which should already exist on all organisms.
    try:
        item = wdi_core.WDItemEngine(data=[ncbi_statement], domain="organism", search_only=True, item_name="organism")
    except wdi_core.ManualInterventionReqException as e:
        # if there are more than one items with this ncbi tax id, this will throw an error!
        # instead, catch it and log the error
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "", str(e), type(e))
        wdi_core.WDItemEngine.log("ERROR", msg)
        return
    
    if item.wd_item_id:
        # if the item exists, create the genbank statement
        reference = create_reference(genbank_id)
        genbank_statement = wdi_core.WDExternalID(genbank_id, PROPS['GenBank Assembly accession'], references=[reference])
        # create the item object, specifying the qid
        item = wdi_core.WDItemEngine(data=[genbank_statement], wd_item_id=item.wd_item_id)
        # use this helper method to perform the write. It automatically writes to a log file and captures errors
        # wdi also has an automatic backoff and retry functionality
        wdi_helpers.try_write(item, record_id=genbank_id, record_prop=PROPS['GenBank Assembly accession'],
                              login=login, edit_summary="update GenBank Assembly accession")
    else:
        # if the item doesn't exist, log it and skip
        msg = wdi_helpers.format_msg(genbank_id, PROPS['GenBank Assembly accession'], "",
                               "No organism found with taxid {}".format(taxid))
        wdi_core.WDItemEngine.log("WARNING", msg)