In [1]:
from datetime import datetime
date = datetime.today().strftime('%y%m%d')
print ('Last modified by Xiaoqing: ' + date)

Last modified by Xiaoqing: 211201


In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('input.csv')
df['criteria']= df['criteria'].str.lower()

# Most common scenarios

In the most common scenario, clinical trials that mention "alcohol" intend to exclude participants who drink too much alcohol. The same logic applies to smoking and recreational drugs. In these cases, we detect keywords, and label whether a clinical trial is excluding participants based on lifestyle factors.

## Note

Mentions of lifestyle factors  can appear in both inclusion and exclusion criteria. For example, a study may want to include non-smokers and exclude smokers.

## Alcohol

Many clinical trials wish to exclude participants with alcoholism. Here are the keywords: alcoholism, alcohol use, alcohol dependence, alcohol abuse

Here are some real examples from clinical trials:
* alcoholism
* Has alcohol and/or drug abuse.
* Drug or alcohol dependence
* Consuming more than 14 alcoholic beverages per week
* Substance or alcohol dependence according to the DSM-IV criteria at randomization (except complete recovered, and caffeine and nicotine dependence)
* Mental illness or history of drug or alcohol abuse that, in the opinion of the investigator, would interfere with the participant's ability to comply with study requirements.
* Significant history of alcoholism or drug abuse.
* Hospitalization anytime since 1990 for Alcohol or Drug Dependence, Depression, or PTSD
* drug-alcoholics addiction ;
* Addictions to alcohol or drugs.
* Chronic alcohol use not diagnosed in criterion 6. i. Subjects unwilling to limit their alcohol intake to 2 standard drinks per day will be excluded. 

If a clinical trial eligibility contains the key word "alcohol", we will label alcohol = 1

In [4]:
df['alcohol'] = 0

for index, row in df.iterrows():
    if 'alcohol' in row['criteria']:
        df.loc[index,'alcohol'] = 1



## Drug

Here we will detect whether criteria contain BOTH a key word from list a and list b.

In [5]:
df['drug'] = 0

a = ['drug', 'substance', 'marijuana']
b = ['dependence', 'addiction', 'abuse', 'use']

In [6]:
for index, row in df.iterrows():
    if any(x in row['criteria'] for x in a) and any(x in row['criteria'] for x in b):
        df.loc[index,'drug'] = 1


## Smoking

Here we will detect whether criteria contain BOTH a key word from list a and list b.

In [18]:
df['smoke'] = 0

a = ['smoked', 'tobacco', 'nicotine']
b = ['dependence', 'addiction', 'abuse', 'use', 'users']
c = ['smoker', 'smoking']

In [19]:
for index, row in df.iterrows():
    if any(x in row['criteria'] for x in c):
        df.loc[index,'smoke'] = 1
    elif any(x in row['criteria'] for x in a) and any(x in row['criteria'] for x in b):
        df.loc[index,'smoke'] = 1

In [20]:
df.tail()

Unnamed: 0,id,criteria,alcohol,drug,smoke
28,29,"smokers, tobacco/snuff/nicotine users, recreat...",0,1,1
29,30,presently a smoker or ex-smoker with history o...,0,0,1
30,31,a history of abuse of psychotropic substances ...,0,1,1
31,32,patients who are actively smoking.,0,0,1
32,33,is an active smoker or stopped smoking in the ...,0,0,1


In [21]:
df.to_csv(('output_'+ date + '.csv'),index = False)

#  The less common scenarios

In the less common scenarios, for example, a clinical trial may want to study cigarettes cesession. In this case, they actually want to recruit participants who smoke.

We need to separate clinical trials into two groups: substance cessation studies and everything else.

if a clinical trial is a substance cessation study, then we know that they DO want participants who drink, use drugs, or smoke.  For each clinical trials in our database, we will verify whether it is a substance related trial.