Script_for_Chemical-Mixtures-in-Household-Environments-In-Silico-Predictions-and-In-Vitro-Testing-of-Potential-Joint-Toxicities-in-Human-Liver-Cells
This script was generated to support the analysis and associated figures contained within the manuscript:
- Carberry CK, Turla T, Koval LE, Hartwell H, Fry RC, Rager JE. Chemical Mixtures in Household Environments: In Silico Predictions and In Vitro Testing of Potential Joint Action on PPARγ in Human Liver Cells. Toxics. 2022 Apr 19;10(5):199. doi: https://doi.org/10.3390/toxics10050199. PMID: 35622613. PMCID: PMC9146550.
make_presence_absence.py- Reads in the chemicals and associated keyword sets from the CPDat Chemical List Presence dataset factotum_listPresence_092320_updated_chemnames_020621.csv, the mapping of CPDat keywords to exposure source categories keyword_esc.csv, and ToxCast chemicals of interest from the PositiveChemicals sheet of ToxCast_PPARg_DataPull_050321.xlsx, which all are located in the input file. Chemicals from both the CPDat and ToxCast datasets are identified as DTXSIDs. It is noted which of the chemicals contained in the CPDat dataset are also contained in the ToxCast dataset. Exposure source categories are mapped to all chemicals in the CPDat dataset based on associated keywords. Duplicate chemical/ exposure source category pairs are dropped and a filter is applied to only keep chemicals that are linked to at least two unique exposure source categories. An nxm dataframe is produced where n is the number of chemicals that pass the >=2 exposure source category filter, and m is the number of exposure source categories. A 1 in the dataframe indicates that there is an association between the chemical and the exposure source category, and 0 indicates there is not an association. The dataframe is written to the file presence_absence_binary_df.csv, which serves as one of the inputs to toxcast_cluster.R. Additionally, a reference file name_dtxsid_ref.csv is produced which maps the true_chemname to the DTXSID for each chemical, according to the original CPDat dataset. Both of these files are written to an output folder.
toxcast_cluster.R- reads in the binary dataframe presence_absence_binary_df.csv and the name/DTXSID mapping name_dtxsid_ref.csv from the output folder. A dataframe of just the 148 chemicals from the CPDat dataset that were also identified as relevant ToxCast chemicals is produced. A distance matrix of the presence/absence data for these 148 chemical is generated by computing a jaccard distance for each combination of chemicals. Divisive hierarchical clustering is performed on the distance matrix to cluster the chemicals. A jaccard distance matrix is then generated for the exposure source categories and hierarchical clustering is employed to cluster exposure source categories. Clustering of the exposure source categories was performed to better organize the the final heatmap, and thereby aid in the interpretation of results. The clustered, organized presence/absence dataframe is written to Toxcast_chems_cluster_assignments.csv in the output folder. A heatamp of this dataframe is then generated and saved as Toxcast_chems_clustered_heatamp.png, which here is stored in the figures folder. Additionally, a heatmap of just the chemicals in cluster 5, the cluster identified as being of greatest interest, is made and saved as cluster_5_heatmap.png, again, stored here in the figures folder.