# Water enviromental package profiling
This notebooks profiles biosamples indicated as being part of a water-related package.  
Profiling, in this sense, means that descrptive analsyis done on the field types, field values, and amount of missing values.
Steps:
1. Load and subset/slice data
2. Normalize water env_package labels
3. Describe water env_packages. This includes:
  - all water-related packages
  - env_packages labled as 'water'
  - env_packages labled as 'waste water'
  - env_packages labled as 'sea water'
  - env_packages labled as 'Deep-sea water'
  - env_packages labled as 'fresh water'
  - env_packages labled as 'sterile water'
  - env_packages labled as 'Coastal water body (ENVO:02000049)'

In [1]:
import pandas as pds
import numpy as np
import csv

# 1. Load and subset/slice data

### Load non-human dataset created by notebook `build-non-human-samples.ipynb`

In [2]:
df = pds.read_csv('../../target/non-human-samples.tsv.gz', sep='\t', dtype=str, quoting=csv.QUOTE_NONE)
len(df)

45951

### get list of env packages the have the string 'water' in them

In [3]:
waterPackages = [x for x in df.env_package.unique() if 'water' in x.lower()]

In [4]:
waterPackages

['water',
 'MIMARKS_water',
 'MIGS/MIMS/MIMARKS.wastewater',
 'MIGS/MIMS/MIMARKS.water',
 'seawater',
 'wastewater/sludge',
 'wastewater|sludge',
 'wastewater',
 'fresh water',
 'freshwater sediment',
 'sea water',
 'sterile water',
 'Coastal water body (ENVO:02000049)',
 'sea water, [ENVO:00002149]',
 'Water',
 'wastewater sludge',
 'Sea water',
 'Deep-sea water']

### filter out packages containing the strings 'sludge' and 'sediment'

In [5]:
waterPackages = [x for x in waterPackages if 'sludge' not in x.lower()]
waterPackages = [x for x in waterPackages if 'sediment' not in x.lower()]

In [6]:
waterPackages

['water',
 'MIMARKS_water',
 'MIGS/MIMS/MIMARKS.wastewater',
 'MIGS/MIMS/MIMARKS.water',
 'seawater',
 'wastewater',
 'fresh water',
 'sea water',
 'sterile water',
 'Coastal water body (ENVO:02000049)',
 'sea water, [ENVO:00002149]',
 'Water',
 'Sea water',
 'Deep-sea water']

### create slice of data based on water package list

In [7]:
sliceDf = df[df.env_package.isin(waterPackages)]
len(sliceDf)

18615

# 2. Normalize water env_package labels
Some of the evn_packages seem to refer to the same water package, but with syntax variations (e.g. 'Water', 'MIMARKS_water').  
To account for this, create column 'norm_env_package' to hold normalized values and normalize data as follows:
1. Map 'Sea water', 'seawater', 'sea water', 'sea water, [envo:00002149]' to 'sea water'.
2. Map 'wastewater', 'MIGS/MIMS/MIMARKS.wastewater' to 'waste water'
3. Map 'water', 'Water', 'mimarks_water', 'migs/mims/mimarks.water' to 'water'
4. Leave 'Deep-sea water', 'fresh water', 'sterile water', and ''Coastal water body (ENVO:02000049)' as the same values

In [8]:
def mapWaterPackage(val):
    if val in ['Sea water', 'seawater', 'sea water', 'sea water, [ENVO:00002149]']:
        return 'sea water'
    elif val in ['wastewater', 'MIGS/MIMS/MIMARKS.wastewater']:
        return 'waste water'
    elif val in ['water', 'Water', 'MIMARKS_water', 'MIGS/MIMS/MIMARKS.water']:
        return 'water'
    else:
        return val

In [9]:
sliceDf['norm_env_package'] = sliceDf['env_package'].map(lambda x: mapWaterPackage(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### verify all values have been normalized (i.e., counts match) and get counts of packages

In [10]:
sliceDf.norm_env_package.count() == len(sliceDf)

True

In [11]:
sliceDf.groupby('norm_env_package')['norm_env_package'].count().sort_values(ascending=False)

norm_env_package
water                                 18139
waste water                             399
sterile water                            35
fresh water                              27
sea water                                13
Deep-sea water                            1
Coastal water body (ENVO:02000049)        1
Name: norm_env_package, dtype: int64

# 3. Describe all water env_packages

### define some helper functions

In [12]:
def replaceMissingValues(df):
    # many of the missing values have values beginning with 'Missing'; e.g.: Missing: Not provided
    # replace these with NaN
    for idx in df.index:                 # iterate over indexes
        row = df.loc[idx,:]              # get the dataframe row; note: this is of type Pandas Series
        for field in row.index:          # iterate over each field (i.e., series index) and replace 'Missing' with Nan
            val = row[field]
            if not pds.isnull(val) and val.lower()[:7] == 'missing': row[field] = np.nan
    return df

In [13]:
def getReducedDatame(df):
    # many of the missing values have values beginning with 'Missing'; e.g.: Missing: Not provided
    # replace these with NaN so that the decribe function can count properly
    reducedDf = replaceMissingValues(df)
    
    # remove columns from the reduced dataframe that don't have data
    describeDf = df.describe()                   # describe the dataframe
    valueCount = describeDf.loc['count']         # get count row / index
    zeroCount = valueCount[valueCount == 0]      # subset valueCount to those with counts of 0
    zeroCountCols = zeroCount.index              # find the columns with zero counts
    reducedDf = df.drop(zeroCountCols, axis=1)   # drop columns that have counts of 0 (i.e., no data)
    
    return reducedDf

In [14]:
def getDescibeDf(df):
    describeDf = df.describe().transpose()                                                  # describe df; note: the transpose
    describeDf['pct_non_null'] =\
        describeDf['count'].map(lambda x: 0.00 if x == 0 else round((x / len(df)) * 100, 2)) # add column to show the percent of non-null data
    return describeDf

## Desribe all water samples

### get reduced dataframe (i.e., columns with no data removed) and describe

In [15]:
reducedDf = getReducedDatame(sliceDf) 
reducedDescribe = getDescibeDf(reducedDf)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### save copy of water-related packages

In [48]:
reducedDf.to_csv('../../target/water-related-samples.tsv.gz', sep='\t', compression='gzip', index=False)

In [49]:
reducedDf.shape

(18615, 97)

In [17]:
pds.set_option('max_rows', None)
reducedDescribe.sort_values(by='pct_non_null', ascending=False)

Unnamed: 0,count,unique,top,freq,pct_non_null
norm_env_package,18615,7,water,18139,100.0
package,18615,19,Generic.1.0,15933,100.0
taxonomy_name,18615,105,marine metagenome,6530,100.0
taxonomy_id,18615,107,408172,6530,100.0
id,18615,18615,BIOSAMPLE:SAMEA4560813,1,100.0
env_package,18615,14,water,16367,100.0
env_local_scale,18525,539,freshwater lake,2887,99.52
env_broad_scale,18517,314,small lake biome,3264,99.47
env_medium,17753,232,fresh water,5839,95.37
depth,15747,868,5,2167,84.59


## Describe env_packages labled as 'water'

In [18]:
waterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'water'].copy()) # note: a copy of the dataframe is being used
waterDescribe = getDescibeDf(waterDf)

In [19]:
waterDf.shape

(18139, 96)

In [20]:
waterDescribe.sort_values(by='pct_non_null', ascending=False)

Unnamed: 0,count,unique,top,freq,pct_non_null
norm_env_package,18139,1,water,18139,100.0
env_package,18139,4,water,16367,100.0
package,18139,13,Generic.1.0,15856,100.0
taxonomy_name,18139,85,marine metagenome,6530,100.0
taxonomy_id,18139,87,408172,6530,100.0
id,18139,18139,BIOSAMPLE:SAMEA4560813,1,100.0
env_broad_scale,18050,257,small lake biome,3264,99.51
env_local_scale,18039,440,freshwater lake,2887,99.45
env_medium,17304,198,fresh water,5829,95.4
depth,15564,858,5,2167,85.8


## Describe env_packages labled as 'waste water'

In [21]:
wasteWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'waste water'].copy()) # note: a copy of the dataframe is being used
wasteWaterDescribe = getDescibeDf(wasteWaterDf)

In [22]:
wasteWaterDf.shape

(399, 38)

In [23]:
wasteWaterDescribe.sort_values(by='pct_non_null', ascending=False)

Unnamed: 0,count,unique,top,freq,pct_non_null
norm_env_package,399,1,waste water,399,100.0
taxonomy_id,399,20,527639,201,100.0
env_medium,399,35,water,119,100.0
env_broad_scale,399,56,activated sludge,66,100.0
id,399,399,BIOSAMPLE:SAMN02261241,1,100.0
env_local_scale,399,96,sludge,63,100.0
taxonomy_name,399,20,wastewater metagenome,201,100.0
env_package,399,2,MIGS/MIMS/MIMARKS.wastewater,385,100.0
package,399,7,MIMS.me.wastewater.5.0,363,100.0
collection_date,396,185,Apr-2013,36,99.25


## Describe env_packages labled as 'sea water'

In [30]:
seaWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'sea water'].copy()) # note: a copy of the dataframe is being used
seaWaterDescribe = getDescibeDf(seaWaterDf)

In [31]:
seaWaterDf.shape

(13, 26)

In [32]:
seaWaterDescribe.sort_values(by='pct_non_null', ascending=False)

Unnamed: 0,count,unique,top,freq,pct_non_null
norm_env_package,13,1,sea water,13,100.0
package,13,3,Generic.1.0,11,100.0
env_local_scale,13,4,intertidal zone,9,100.0
taxonomy_id,13,4,1262161,9,100.0
taxonomy_name,13,4,Pyropia haitanensis,9,100.0
id,13,13,BIOSAMPLE:SAMN01818488,1,100.0
geo_loc_name,13,4,China: Fujian Province,9,100.0
env_package,13,4,seawater,9,100.0
isol_growth_condt,11,9,"21C, 50 mol photons m2 s1 (12L:13D)",2,84.62
collection_date,11,3,19-Mar-2012,9,84.62


## Describe env_packages labled as 'Deep-sea water'

In [34]:
deepSeaWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'Deep-sea water'].copy()) # note: a copy of the dataframe is being used
deepSeaWaterDescribe = getDescibeDf(deepSeaWaterDf)

In [35]:
deepSeaWaterDf.shape

(1, 21)

In [36]:
deepSeaWaterDescribe

Unnamed: 0,count,unique,top,freq,pct_non_null
num_replicons,1,1,1,1,100.0
env_package,1,1,Deep-sea water,1,100.0
geo_loc_name,1,1,Indian Ocean,1,100.0
biotic_relationship,1,1,free living,1,100.0
project_name,1,1,Complete genome sequence of Alcanivorax sp. IO_7,1,100.0
salinity,1,1,0Ã¢ÂÂ15% (optimum at 3.5%),1,100.0
investigation_type,1,1,bacteria_archaea,1,100.0
collection_date,1,1,2017,1,100.0
temp,1,1,5Ã¢ÂÂ40Ã¢ÂÂ¯ÃÂ°C (optimum at 28Ã¢ÂÂ¯ÃÂ°C),1,100.0
env_broad_scale,1,1,Sea [ENVO:00000016],1,100.0


## Describe env_packages labled as 'fresh water'

In [37]:
freshWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'fresh water'].copy()) # note: a copy of the dataframe is being used
freshWaterDescribe = getDescibeDf(freshWaterDf)

In [38]:
freshWaterDf.shape

(27, 13)

In [39]:
freshWaterDescribe

Unnamed: 0,count,unique,top,freq,pct_non_null
env_package,27,1,fresh water,27,100.0
host_taxid,27,4,4530,10,100.0
geo_loc_name,27,2,USA,17,100.0
elev,27,2,0,17,100.0
env_broad_scale,27,2,estuarine biome,17,100.0
env_local_scale,27,2,large river biome,17,100.0
env_medium,10,1,fresh water,10,37.04
depth,27,1,0,27,100.0
id,27,27,BIOSAMPLE:SAMEA4036406,1,100.0
taxonomy_id,27,3,449393,16,100.0


## Describe env_packages labled as 'sterile water'

In [40]:
sterileWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'sterile water'].copy()) # note: a copy of the dataframe is being used
sterileWaterDescribe = getDescibeDf(sterileWaterDf)

In [41]:
sterileWaterDf.shape

(35, 14)

In [42]:
sterileWaterDescribe

Unnamed: 0,count,unique,top,freq,pct_non_null
env_package,35,1,sterile water,35,100.0
host_taxid,35,1,256318,35,100.0
geo_loc_name,35,1,Costa Rica,35,100.0
elev,35,1,1656,35,100.0
collection_date,35,1,2050-01-01 00:00:00,35,100.0
env_broad_scale,35,1,urban biome,35,100.0
env_local_scale,35,1,research facility,35,100.0
env_medium,35,1,sterile water,35,100.0
depth,35,1,0,35,100.0
id,35,35,BIOSAMPLE:SAMEA4083307,1,100.0


## Describe env_packages labled as 'Coastal water body (ENVO:02000049)'

In [43]:
costalWaterDf = getReducedDatame(reducedDf[reducedDf.norm_env_package == 'Coastal water body (ENVO:02000049)'].copy()) # note: a copy of the dataframe is being used
costalWaterDescribe = getDescibeDf(costalWaterDf)

In [44]:
costalWaterDf.shape

(1, 19)

In [45]:
costalWaterDescribe

Unnamed: 0,count,unique,top,freq,pct_non_null
rel_to_oxygen,1,1,aerobe,1,100.0
env_package,1,1,Coastal water body (ENVO:02000049),1,100.0
propagation,1,1,binary fission,1,100.0
samp_mat_process,1,1,Stored in the dark at 4ÃÂ°C until cyst germin...,1,100.0
geo_loc_name,1,1,South Korea: Shiwha,1,100.0
biotic_relationship,1,1,free living,1,100.0
collection_date,1,1,Sep-2010,1,100.0
temp,1,1,21.3 C,1,100.0
env_broad_scale,1,1,Sea coast (ENVO:00000303),1,100.0
env_local_scale,1,1,Sea floor (ENVO:00000482),1,100.0
