# Correlations between transcriptome and microbiome

I (RACS) am using this second notebook to improve how identifiers are associated between the 16S and the expression data, and for a bettter organization of downstream analyses. Analyses here should overwrite all previous analyses in the older notebook.

## Datasets

### RNAseq data

 * Matrix with TPM values from quantifying (Salmon) with cleaned reads against the maize transcriptome (representative transcripts of maize version 5)

### 16S data

Three matrices will be used:

 * Feature table with ASV counts after generating ASVs with dada2 and postprocessing with Qiime2
 * Feature table with phylotypes after colapsing ASVs into phylotypes from MaLiAmPi
 * Feature table with phylotypes after colapsing ASVs into phylotypes from the phylogenetic placement with PICRUSt2 (EPA-NG)

I (RACS) had more control on steps with the Dada2/EPA-NG PICRUSt2 pipeline.

# Associations between RNAseq and 16S



Using `0_kremling_expression_key.txt` from [Dr. Wallace`s FigShare](https://doi.org/10.6084/m9.figshare.5886769.v2) to map SRAs from both data (16S and metataxonomics).

In [1]:
kremling_expression_key = '/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/0_kremling_expression_key.txt'
sra_run_table_16s = '/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/SraRunInfo_Wallace_etal_2018.csv'
sra_run_table_rnaseq = '/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/run_info/SraRunInfo_Kremling_etal_2018.csv'

dict_wallace_kremling_2018 = {}
kremling_expression_key_dict = {}

In [2]:
with open(kremling_expression_key, 'r') as file:

    _ = file.readline()

    for line in file:
        fields = line.strip().split('\t')
        
        kremling_identifier = fields[0]
        wallace_identifier = fields[1]

        kremling_expression_key_dict[kremling_identifier] = wallace_identifier

In [3]:
import re

with open(sra_run_table_rnaseq, 'r') as file:

    _ = file.readline()

    for line in file:
        fields = line.strip().split(',')
        fields2 = fields[11].split('_')
        rnaseq_run_id = fields[0]
        sample_id = fields2[1]
        rnaseq_genotype = fields2[2]
        day = ''
        match = re.search(r'\d+', sample_id)
        unmatched_parts = re.split(r'\d+', sample_id)
        day_period = unmatched_parts[0]
        if match:
            day = int(match.group())
        if sample_id.startswith('LMA') and rnaseq_genotype != '#N/A':
            dict_wallace_kremling_2018[fields[11]] = {'run_accession_16s': '',
                                    'run_accession_rnaseq': rnaseq_run_id,
                                    'day': day,
                                    'day_period': day_period,
                                    'genotype_16s': '',
                                    'genotype_rnaseq': rnaseq_genotype}

In [4]:
rnaseq_samples_with_16s = 0

with open(sra_run_table_16s, 'r') as file:

    _ = file.readline()

    for line in file:
        fields = line.strip().split(',')
        fields2 = fields[11].split('.')
        metataxonomics_run_id = fields[0]
        day = int(fields2[1])
        day_period = fields2[0]
        for key, value in kremling_expression_key_dict.items():
            if value == fields[11]:
                if dict_wallace_kremling_2018[key]['day'] != day:
                    print('Big problem!')
                    print(day, dict_wallace_kremling_2018[key]['day'])
                    print(dict_wallace_kremling_2018[key])
                    print(value, fields[11], key)
                    exit(1)
                if dict_wallace_kremling_2018[key]['day_period'] != day_period:
                    print('Big problem!')
                    if key == '10343927_LMAN8_CML505_CAACAG':
                        #print("It's ok. I know this sample is problematic.")
                        continue
                    else:
                        print(day_period, dict_wallace_kremling_2018[key]['day_period'])
                        print(dict_wallace_kremling_2018[key])
                        print(value, fields[11], key)
                        exit(1)
                dict_wallace_kremling_2018[key]['run_accession_16s'] = metataxonomics_run_id
                rnaseq_samples_with_16s+=1

print(f'{rnaseq_samples_with_16s} sample pairs found.')

Big problem!
484 sample pairs found.


In [5]:
no_16s = 0
for key, value in dict_wallace_kremling_2018.items():
    if value['run_accession_16s'] == '':
        print(key, value)
        no_16s+=1
print(f'{no_16s} samples without 16S data.')

10343927_LMAN8_B73_CACACT {'run_accession_16s': '', 'run_accession_rnaseq': 'SRR5909633', 'day': 8, 'day_period': 'LMAN', 'genotype_16s': '', 'genotype_rnaseq': 'B73'}
10343927_LMAN8_CML505_CAACAG {'run_accession_16s': '', 'run_accession_rnaseq': 'SRR5911345', 'day': 8, 'day_period': 'LMAN', 'genotype_16s': '', 'genotype_rnaseq': 'CML505'}
2 samples without 16S data.


# Generating a matrix with both RNAseq and Metataxonomic data for the ASV data

Associations between 16S and RNAseq data are present in the 'dict_wallace_kremling_2018' dictionary.

In [6]:
run2my_sample_id = {}

for key in dict_wallace_kremling_2018:
    if dict_wallace_kremling_2018[key]['run_accession_rnaseq']:
        run2my_sample_id[dict_wallace_kremling_2018[key]['run_accession_rnaseq']] = key
    if dict_wallace_kremling_2018[key]['run_accession_16s']:
        run2my_sample_id[dict_wallace_kremling_2018[key]['run_accession_16s']] = key

In [7]:
import pandas as pd

# Importing expression data from Kremling et al. 2018 (TPM matrix on Maize v5 using Salmon after cleaning with cutadapt)
kremling_expression_v5 = pd.read_csv('/media/rsantos/4TB_drive/Projects/UGA_RACS/RNAseq/Salmon/Zma2_tpm_matrix.txt', sep='\t')

# Rename column and reset the index
kremling_expression_v5.set_index('Name', inplace=True)

# Print the dataframe
kremling_expression_v5.head()

Unnamed: 0_level_0,SRR5909626,SRR5909627,SRR5909633,SRR5909635,SRR5909639,SRR5909642,SRR5909645,SRR5909653,SRR5909655,SRR5909665,...,SRR5912073,SRR5912081,SRR5912082,SRR5912083,SRR5912093,SRR5912094,SRR5912104,SRR5912105,SRR5912111,SRR5912116
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zm00001eb371370_T002,1.04145,0.0,3.39106,0.0,0.0,1.82712,0.284514,2.23201,0.437147,0.468934,...,0.0,1.51042,0.0,0.0,0.0,2.82055,3.96967,0.0,2.96105,0.0
Zm00001eb371350_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371330_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371310_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371280_T001,1.2765,2.1092,0.692731,0.0,4.2798,1.47496,2.55732,0.0,1.06594,1.14953,...,3.02253,0.4114,1.17447,0.0,3.48749,9.47506,6.19189,3.80776,1.03695,1.14981


#### Renaming columns

Renaming columns of Kremling data based on associations in 'run2my_sample_id'based'


In [8]:
# Rename the columns using the dictionary
kremling_expression_v5 = kremling_expression_v5.rename(columns=run2my_sample_id)
kremling_expression_v5.columns = [str(x) for x in kremling_expression_v5.columns]

kremling_expression_v5.head()

Unnamed: 0_level_0,10343927_LMAD26_CI21E_AAGTGG,10343264_LMAN26_CI21E_ATGAAC,10343927_LMAN8_B73_CACACT,10343264_LMAN26_B64_ACCAGT,10343262_LMAN8_B109_TGCTAT,10343262_LMAN8_B14A_CTCTCG,10343262_LMAN8_B57_CCTAAG,10343927_LMAD26_B77_TAATCG,10343262_LMAN8_B79_GCAGCC,10343927_LMAN8_CI187-2_GACGAT,...,10344826_LMAN8_I29_ACGTCT,10344823_LMAD8_IA2132_ACACGC,10343264_LMAD26_CML91_AACGCC,10344827_LMAN26_CML91_AATCCG,10344827_LMAN26_Ki21_AAGACA,10343927_LMAD26_Ki21_ACGTCT,10344826_LMAD8_E2558W_CGCAAC,10343927_LMAN8_E2558W_GAACCT,10344826_LMAD8_IDS69_CAGGAC,10343927_LMAN8_IDS69_ACATTA
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zm00001eb371370_T002,1.04145,0.0,3.39106,0.0,0.0,1.82712,0.284514,2.23201,0.437147,0.468934,...,0.0,1.51042,0.0,0.0,0.0,2.82055,3.96967,0.0,2.96105,0.0
Zm00001eb371350_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371330_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371310_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371280_T001,1.2765,2.1092,0.692731,0.0,4.2798,1.47496,2.55732,0.0,1.06594,1.14953,...,3.02253,0.4114,1.17447,0.0,3.48749,9.47506,6.19189,3.80776,1.03695,1.14981


In [9]:
# Importing ASV data; generated from processing 16S data from Wallace et al. (2018)
wallace_asvs = pd.read_csv('/media/rsantos/4TB_drive/Projects/UGA_RACS/16S/Qiime2/dada2/as_single_q20/table-paired-end_wallace2018_assingle_forward_q20-dada2_feature-table/q20_fw_feature-table.tsv',
                           sep='\t')

# Rename column and reset the index
wallace_asvs.rename(columns={'ASV': 'Name'}, inplace=True)
wallace_asvs.set_index('Name', inplace=True)

# Print the dataframe
wallace_asvs.head()

Unnamed: 0_level_0,SRR6665476,SRR6665477,SRR6665478,SRR6665479,SRR6665480,SRR6665481,SRR6665482,SRR6665483,SRR6665484,SRR6665485,...,SRR6666058,SRR6666059,SRR6666060,SRR6666061,SRR6666062,SRR6666063,SRR6666064,SRR6666065,SRR6666066,SRR6666067
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bc664ea528899e36452dd37c1f55a48a,47869.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,78028.0,0.0,0.0,0.0,0.0,0.0,58946.0,3868.0,0.0
232ad9e267688a5d573112b4855bac96,0.0,2727.0,4065.0,27528.0,7244.0,3035.0,2433.0,847.0,2351.0,830.0,...,18215.0,0.0,9866.0,13921.0,29850.0,1713.0,11708.0,0.0,0.0,3469.0
6967c9a10eff11f751218e759df28ab7,0.0,610.0,4147.0,267.0,6479.0,7206.0,6862.0,7565.0,12271.0,1298.0,...,742.0,0.0,12448.0,225.0,7830.0,47503.0,1241.0,0.0,0.0,920.0
fa79d5937f424b58a27843dfff8bdcd4,0.0,1837.0,2993.0,18227.0,4525.0,1975.0,1701.0,545.0,1492.0,567.0,...,12277.0,0.0,6519.0,8965.0,22337.0,1113.0,7897.0,0.0,0.0,2223.0
e6b96dce8fbd261b8836b93b9a1d5e07,0.0,1767.0,3093.0,19988.0,5185.0,2100.0,1768.0,542.0,1485.0,603.0,...,11994.0,0.0,6479.0,9388.0,25538.0,1041.0,7923.0,0.0,0.0,2095.0


Renaming columns of Wallace data based on associations in 'run2my_sample_id'based'

In [10]:
# Rename the columns using the dictionary
wallace_asvs = wallace_asvs.rename(columns=run2my_sample_id)
wallace_asvs.columns = [str(x) for x in wallace_asvs.columns]

In [11]:
wallace_asvs.head()

Unnamed: 0_level_0,SRR6665476,10343264_LMAN26_B73_GTGTAG,10343264_LMAN26_NC262_ACAGAT,10343264_LMAN26_CML10_AGACCA,10343264_LMAN26_NC314_ACGTCT,10343264_LMAN26_B46_ACCGTG,10343264_LMAN26_B84_GTGCCA,10343264_LMAN26_B73_ACTCTT,10343264_LMAN26_B77_GTAGAA,10344826_LMAN8_F7_GGCTGC,...,SRR6666058,SRR6666059,SRR6666060,SRR6666061,10344827_LMAN26_I137TN_ACATTA,10343264_LMAN26_CI64_ATATCC,10343927_LMAD26_CML154Q_ACAGAT,10343927_LMAD26_T234_GTCAGG,10343927_LMAD26_NC344_AATGAA,10343927_LMAD26_K64_CCTGCT
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bc664ea528899e36452dd37c1f55a48a,47869.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,78028.0,0.0,0.0,0.0,0.0,0.0,58946.0,3868.0,0.0
232ad9e267688a5d573112b4855bac96,0.0,2727.0,4065.0,27528.0,7244.0,3035.0,2433.0,847.0,2351.0,830.0,...,18215.0,0.0,9866.0,13921.0,29850.0,1713.0,11708.0,0.0,0.0,3469.0
6967c9a10eff11f751218e759df28ab7,0.0,610.0,4147.0,267.0,6479.0,7206.0,6862.0,7565.0,12271.0,1298.0,...,742.0,0.0,12448.0,225.0,7830.0,47503.0,1241.0,0.0,0.0,920.0
fa79d5937f424b58a27843dfff8bdcd4,0.0,1837.0,2993.0,18227.0,4525.0,1975.0,1701.0,545.0,1492.0,567.0,...,12277.0,0.0,6519.0,8965.0,22337.0,1113.0,7897.0,0.0,0.0,2223.0
e6b96dce8fbd261b8836b93b9a1d5e07,0.0,1767.0,3093.0,19988.0,5185.0,2100.0,1768.0,542.0,1485.0,603.0,...,11994.0,0.0,6479.0,9388.0,25538.0,1041.0,7923.0,0.0,0.0,2095.0


#### Ensuring Wallace df has the same columns as Kremling df

In [12]:
kremling_expression_v5 = kremling_expression_v5.filter(items=wallace_asvs.columns)

In [13]:
wallace_asvs = wallace_asvs.filter(items=kremling_expression_v5.columns)

In [14]:
print(kremling_expression_v5.shape)
kremling_expression_v5.head()

(39096, 482)


Unnamed: 0_level_0,10343264_LMAN26_B73_GTGTAG,10343264_LMAN26_NC262_ACAGAT,10343264_LMAN26_CML10_AGACCA,10343264_LMAN26_NC314_ACGTCT,10343264_LMAN26_B46_ACCGTG,10343264_LMAN26_B84_GTGCCA,10343264_LMAN26_B73_ACTCTT,10343264_LMAN26_B77_GTAGAA,10344826_LMAN8_F7_GGCTGC,10344826_LMAN8_ND246_CGTCGC,...,10344826_LMAD8_NC358_GCAGCC,10344826_LMAD8_NC294_CGATCT,10344826_LMAD8_K55_AAGACA,10344827_LMAN26_B73_GAACCT,10344827_LMAN26_I137TN_ACATTA,10343264_LMAN26_CI64_ATATCC,10343927_LMAD26_CML154Q_ACAGAT,10343927_LMAD26_T234_GTCAGG,10343927_LMAD26_NC344_AATGAA,10343927_LMAD26_K64_CCTGCT
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zm00001eb371370_T002,1.34982,0.0,0.0,0.0,0.0,0.359664,0.541724,0.0,0.0,0.0,...,2.11575,1.24155,1.24528,0.0,0.0,0.0,0.808996,0.74369,3.4292,0.0
Zm00001eb371350_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371330_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371310_T001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zm00001eb371280_T001,0.441188,0.0,0.215403,0.0,0.817563,2.35113,0.885315,0.517418,1.56597,0.0,...,2.99848,2.70535,2.00021,2.23636,0.0,2.35318,2.97472,0.911501,1.52984,4.15375


In [15]:
print(wallace_asvs.shape)
wallace_asvs.head()

(6241, 482)


Unnamed: 0_level_0,10343264_LMAN26_B73_GTGTAG,10343264_LMAN26_NC262_ACAGAT,10343264_LMAN26_CML10_AGACCA,10343264_LMAN26_NC314_ACGTCT,10343264_LMAN26_B46_ACCGTG,10343264_LMAN26_B84_GTGCCA,10343264_LMAN26_B73_ACTCTT,10343264_LMAN26_B77_GTAGAA,10344826_LMAN8_F7_GGCTGC,10344826_LMAN8_ND246_CGTCGC,...,10344826_LMAD8_NC358_GCAGCC,10344826_LMAD8_NC294_CGATCT,10344826_LMAD8_K55_AAGACA,10344827_LMAN26_B73_GAACCT,10344827_LMAN26_I137TN_ACATTA,10343264_LMAN26_CI64_ATATCC,10343927_LMAD26_CML154Q_ACAGAT,10343927_LMAD26_T234_GTCAGG,10343927_LMAD26_NC344_AATGAA,10343927_LMAD26_K64_CCTGCT
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bc664ea528899e36452dd37c1f55a48a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,156104.0,91111.0,90631.0,0.0,0.0,0.0,0.0,58946.0,3868.0,0.0
232ad9e267688a5d573112b4855bac96,2727.0,4065.0,27528.0,7244.0,3035.0,2433.0,847.0,2351.0,830.0,141.0,...,0.0,0.0,0.0,423.0,29850.0,1713.0,11708.0,0.0,0.0,3469.0
6967c9a10eff11f751218e759df28ab7,610.0,4147.0,267.0,6479.0,7206.0,6862.0,7565.0,12271.0,1298.0,284.0,...,0.0,0.0,0.0,1682.0,7830.0,47503.0,1241.0,0.0,0.0,920.0
fa79d5937f424b58a27843dfff8bdcd4,1837.0,2993.0,18227.0,4525.0,1975.0,1701.0,545.0,1492.0,567.0,92.0,...,0.0,0.0,0.0,280.0,22337.0,1113.0,7897.0,0.0,0.0,2223.0
e6b96dce8fbd261b8836b93b9a1d5e07,1767.0,3093.0,19988.0,5185.0,2100.0,1768.0,542.0,1485.0,603.0,88.0,...,0.0,0.0,0.0,308.0,25538.0,1041.0,7923.0,0.0,0.0,2095.0


In [16]:
if wallace_asvs.columns.all() == kremling_expression_v5.columns.all():
    print('Columns are equal!')

Columns are equal!


# Filtering out ASVs and genes with many zeros

In [17]:
def count_zeros(df, threshold=0.5):
    # Count the number of zeros in each row
    # If the number of zeros is greater than the threshold, remove the row
    threshold_int = int(df.shape[1] * threshold)
    print(f'Threshold: {threshold_int} (threshold * number of columns)')
    zero_counts = df.apply(lambda row: (row == 0).sum(), axis=1)
    return df[zero_counts < threshold_int]

In [18]:
wallace_asvs_zeros_filtered = count_zeros(wallace_asvs)
kremling_expression_v5_zeros_filtered = count_zeros(kremling_expression_v5)

Threshold: 241 (threshold * number of columns)
Threshold: 241 (threshold * number of columns)


In [19]:
print(wallace_asvs_zeros_filtered.shape)
print(kremling_expression_v5_zeros_filtered.shape)

(13, 482)
(19953, 482)


# Concatenating matrices and computing correlations for the ASVs


In [20]:
concatenated_df = pd.concat([wallace_asvs_zeros_filtered, kremling_expression_v5_zeros_filtered], axis=0)

In [21]:
print(wallace_asvs_zeros_filtered.shape)
print(kremling_expression_v5_zeros_filtered.shape)
print(concatenated_df.shape)

(13, 482)
(19953, 482)
(19966, 482)


In [22]:
concatenated_transposed = concatenated_df.T

At the time of writing, there are at least two interesting approaches:

 * Deep Graph
 * CorALS

I (RACS) decided to start by using CorALS

In [23]:
from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import numpy as np
from corals.correlation.full.default import cor_full



In [24]:
cor_values = cor_full(concatenated_transposed)
cor_values.shape

(19966, 19966)

In [25]:
highly_correlated_pairs_df = pd.DataFrame(columns=['feature1', 'feature2', 'correlation'])
correlated_pairs_file = open('correlated_pairs_v2.txt', 'w')

# Find the highly correlated pairs
for i in range(len(cor_values.columns)):
    for j in range(i+1, len(cor_values.columns)):
        if cor_values.columns[i] == cor_values.columns[j] or\
            (cor_values.columns[j].startswith('Zm') and cor_values.columns[i].startswith('Zm')):
            continue
        if ((abs(cor_values.iloc[i, j]) > 0.6) or\
           (abs(cor_values.iloc[i, j]) < -0.6)) and\
            (cor_values.columns[j].startswith('Zm') or cor_values.columns[i].startswith('Zm')):
            pair = (cor_values.columns[i], cor_values.columns[j])
            highly_correlated_pairs_df.loc[len(highly_correlated_pairs_df.index)] = [cor_values.columns[i],
                                                                                     cor_values.columns[j],
                                                                                     cor_values.iloc[i, j]]
            correlated_pairs_file.write(f'{cor_values.columns[i]}\t{cor_values.columns[j]}\t{cor_values.iloc[i, j]}\n')

correlated_pairs_file.close()

# Calculating p-values and corrected p-values

In [60]:
from corals.correlation.topk.default import cor_topk
cor_topk_values, cor_topk_coo = cor_topk(concatenated_transposed, correlation_type="spearman", k=0.001, n_jobs=8)

from corals.correlation.utils import derive_pvalues, multiple_test_correction
n_samples = concatenated_transposed.shape[0]
n_features = concatenated_transposed.shape[1]

# calculate p-values
pvalues = derive_pvalues(cor_topk_values, n_samples)

# multiple hypothesis correction
pvalues_corrected = multiple_test_correction(pvalues, n_features, method="fdr_bh")

  ts = rf * rf * (df / (1 - rf * rf))


In [65]:
pvalues_corrected

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       6.89406216e-43, 6.89785885e-43, 6.89980704e-43])

In [62]:
pvalues

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       6.89404217e-46, 6.89785615e-46, 6.89982164e-46])

In [59]:
concatenated_transposed.head()

Name,232ad9e267688a5d573112b4855bac96,6967c9a10eff11f751218e759df28ab7,fa79d5937f424b58a27843dfff8bdcd4,e6b96dce8fbd261b8836b93b9a1d5e07,1674323e4fe615dc003edd628305bc9f,6935437446b9c69c21f6ac4518b2eb04,8db37fcbc11f63d8a46690adeb7cad70,52c0751a4259810b7c12be45c6597335,e588c7fc94221bf561c85d536b986ef9,d8fac1aa74436b8041e29a3237da7955,...,Zm00001eb113400_T001,Zm00001eb398410_T001,Zm00001eb282510_T001,Zm00001eb011050_T001,Zm00001eb068130_T003,Zm00001eb161390_T001,Zm00001eb111420_T001,Zm00001eb363130_T001,Zm00001eb046990_T001,Zm00001eb282660_T002
10343264_LMAN26_B73_GTGTAG,2727.0,610.0,1837.0,1767.0,2086.0,394.0,390.0,1402.0,1351.0,385.0,...,2.22215,0.931736,0.409951,11.9725,11.7956,0.990159,0.0,48.1324,1.81587,16.6081
10343264_LMAN26_NC262_ACAGAT,4065.0,4147.0,2993.0,3093.0,7930.0,2683.0,2692.0,5340.0,5537.0,4133.0,...,0.0,1.98341,1.24667,2.67712,11.5257,37.873,0.0,52.6972,0.0,30.7245
10343264_LMAN26_CML10_AGACCA,27528.0,267.0,18227.0,19988.0,13604.0,205.0,211.0,9368.0,10010.0,384.0,...,0.0,0.454905,3.00228,7.99444,6.98635,19.7536,0.190885,22.8303,0.688831,28.8076
10343264_LMAN26_NC314_ACGTCT,7244.0,6479.0,4525.0,5185.0,19280.0,4397.0,4618.0,13200.0,13543.0,4708.0,...,0.0,0.539022,1.42297,4.58356,10.7393,0.57282,0.0,41.7757,1.14365,16.436
10343264_LMAN26_B46_ACCGTG,3035.0,7206.0,1975.0,2100.0,8901.0,4712.0,4984.0,5807.0,6101.0,51518.0,...,0.40703,0.0,1.51936,3.75208,8.24168,0.917428,0.543377,44.2821,2.36366,30.2941


In [58]:
from corals.correlation.topk.default import cor_topk
cor_topk_values_pearson, cor_topk_coo_pearson = cor_topk(concatenated_transposed, correlation_type="pearson", k=0.001, n_jobs=8)

from corals.correlation.utils import derive_pvalues, multiple_test_correction
n_samples = concatenated_transposed.shape[0]
n_features = concatenated_transposed.shape[1]

# calculate p-values
pvalues_pearson = derive_pvalues(cor_topk_values_pearson, n_samples)

# multiple hypothesis correction
pvalues_corrected_pearson = multiple_test_correction(pvalues, n_features, method="fdr_bh")

InvalidIndexError: (slice(12480, 14976, None), slice(None, None, None))

In [54]:
concatenated_transposed.head()

Name,232ad9e267688a5d573112b4855bac96,6967c9a10eff11f751218e759df28ab7,fa79d5937f424b58a27843dfff8bdcd4,e6b96dce8fbd261b8836b93b9a1d5e07,1674323e4fe615dc003edd628305bc9f,6935437446b9c69c21f6ac4518b2eb04,8db37fcbc11f63d8a46690adeb7cad70,52c0751a4259810b7c12be45c6597335,e588c7fc94221bf561c85d536b986ef9,d8fac1aa74436b8041e29a3237da7955,...,Zm00001eb113400_T001,Zm00001eb398410_T001,Zm00001eb282510_T001,Zm00001eb011050_T001,Zm00001eb068130_T003,Zm00001eb161390_T001,Zm00001eb111420_T001,Zm00001eb363130_T001,Zm00001eb046990_T001,Zm00001eb282660_T002
10343264_LMAN26_B73_GTGTAG,2727.0,610.0,1837.0,1767.0,2086.0,394.0,390.0,1402.0,1351.0,385.0,...,2.22215,0.931736,0.409951,11.9725,11.7956,0.990159,0.0,48.1324,1.81587,16.6081
10343264_LMAN26_NC262_ACAGAT,4065.0,4147.0,2993.0,3093.0,7930.0,2683.0,2692.0,5340.0,5537.0,4133.0,...,0.0,1.98341,1.24667,2.67712,11.5257,37.873,0.0,52.6972,0.0,30.7245
10343264_LMAN26_CML10_AGACCA,27528.0,267.0,18227.0,19988.0,13604.0,205.0,211.0,9368.0,10010.0,384.0,...,0.0,0.454905,3.00228,7.99444,6.98635,19.7536,0.190885,22.8303,0.688831,28.8076
10343264_LMAN26_NC314_ACGTCT,7244.0,6479.0,4525.0,5185.0,19280.0,4397.0,4618.0,13200.0,13543.0,4708.0,...,0.0,0.539022,1.42297,4.58356,10.7393,0.57282,0.0,41.7757,1.14365,16.436
10343264_LMAN26_B46_ACCGTG,3035.0,7206.0,1975.0,2100.0,8901.0,4712.0,4984.0,5807.0,6101.0,51518.0,...,0.40703,0.0,1.51936,3.75208,8.24168,0.917428,0.543377,44.2821,2.36366,30.2941


# Computing correlations between RNAseq and Metataxonomic data after colapsing to phylotypes

Given the sparsity of running correlation analyses between gene expression and ASVs, an alternative (and possibly better) solution is to colapse ASVs generated with Dada2 into phylotypes. A promising way to do it is using methods implemented by [Minot et al (2023)](https://doi.org/10.1016/j.crmeth.2023.100639), who described the Nextflow workflow MaLiAmPi and the Python package called "phylotypes".

I (RACS) installed phylotypes and dependencies in a conda environment with Python 3.10 (which can be used to install all of its dependencies).

Two files will be used to generate the matrix with counts by sample and phylotypes:

 * `dada2.sv.shared.txt`, which contains the SV counts in a TSV format similar to mothur sharefile. (described [here](https://github.com/jgolob/maliampi))
 * `phylotypes_maliampi_q20_fw`, output of phylotypes, which is described [here](https://github.com/jgolob/phylotypes)

In [26]:
dada2_sv_shared_df = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/maliampi_phylotypes/dada2.sv.shared.txt', sep='\t')
dada2_sv_shared_df = dada2_sv_shared_df.rename(columns={'label': 'sv'}) # Rename the column to sv
dada2_sv_shared_df.drop(columns=['group', 'numsvs'], inplace=True) # Drop the columns group and numsvs
dada2_sv_shared_df.set_index('sv', inplace=True) # Set the index to sv
dada2_sv_shared_transposed_df = dada2_sv_shared_df.transpose() # Transpose the dataframe
dada2_sv_shared_transposed_df = dada2_sv_shared_transposed_df.rename_axis('sv') # Rename the index to sv
dada2_sv_shared_transposed_df.head()

sv,SRR6665481,SRR6665480,SRR6665490,SRR6665479,SRR6665489,SRR6665487,SRR6665478,SRR6665486,SRR6665483,SRR6665476,...,SRR6666061,SRR6666053,SRR6666059,SRR6666062,SRR6666064,SRR6666058,SRR6666066,SRR6666065,SRR6666067,SRR6666063
sv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
sv-1,0,0,0,0,43289,179248,0,0,0,66409,...,0,0,109404,0,0,0,5420,82828,0,0
sv-2,3096,7424,9923,28314,0,0,4212,154,844,0,...,14253,19142,0,30996,12060,18880,0,0,3607,1734
sv-3,2122,5338,6263,20682,0,0,3175,93,537,0,...,9614,13114,0,26409,8222,12309,0,0,2196,1041
sv-4,1988,4579,6633,18702,0,0,3066,94,535,0,...,9235,12811,0,23309,8105,12609,0,0,2320,1135
sv-5,7433,6608,4429,282,0,0,4296,306,7815,0,...,233,856,0,8086,1303,783,0,0,977,49259


In [27]:
phylotypes_maliampi_q20_fw_df = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/maliampi_phylotypes/phylotypes_maliampi_q20_fw_1_0.txt')
phylotypes_maliampi_q20_fw_df[['sv_from_str',
                               'sample']] = phylotypes_maliampi_q20_fw_df['sv'].str.split(":",
                                                                                          regex=True,
                                                                                          expand=True) # Split the sv column into two columns
phylotypes_maliampi_q20_fw_df.drop(columns=['sv', 'sample'], inplace=True) # Drop the sv and sample columns
phylotypes_maliampi_q20_fw_df = phylotypes_maliampi_q20_fw_df.rename(columns={'sv_from_str': 'sv'}) # Rename the column to sv
phylotypes_maliampi_q20_fw_df.head()

Unnamed: 0,phylotype,sv
0,pt__00001,sv-3378
1,pt__00001,sv-6195
2,pt__00001,sv-1406
3,pt__00001,sv-4698
4,pt__00001,sv-580


In [28]:
phylotypes_counts_df = pd.merge(phylotypes_maliampi_q20_fw_df,
         dada2_sv_shared_transposed_df,
         on='sv',
         how='inner') # Merge the two dataframes on the sv column
phylotypes_counts_df.drop(columns=['sv'], inplace=True) # Drop the sv column
phylotypes_counts_df.set_index('phylotype', inplace=True) # Set the index to phylotype
phylotypes_counts_df.head()

Unnamed: 0_level_0,SRR6665481,SRR6665480,SRR6665490,SRR6665479,SRR6665489,SRR6665487,SRR6665478,SRR6665486,SRR6665483,SRR6665476,...,SRR6666061,SRR6666053,SRR6666059,SRR6666062,SRR6666064,SRR6666058,SRR6666066,SRR6666065,SRR6666067,SRR6666063
phylotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pt__00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pt__00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pt__00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
pt__00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pt__00001,8,26,0,0,0,0,8,0,0,0,...,0,13,0,0,0,0,0,0,0,10


In [29]:
sum_by_group = phylotypes_counts_df.groupby('phylotype').sum()
sum_by_group.head()

Unnamed: 0_level_0,SRR6665481,SRR6665480,SRR6665490,SRR6665479,SRR6665489,SRR6665487,SRR6665478,SRR6665486,SRR6665483,SRR6665476,...,SRR6666061,SRR6666053,SRR6666059,SRR6666062,SRR6666064,SRR6666058,SRR6666066,SRR6666065,SRR6666067,SRR6666063
phylotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pt__00001,8122,5519,1854,48309,0,0,7870,673,3924,0,...,3091,18535,0,13633,7134,6358,0,0,3501,9797
pt__00002,171679,92936,56503,40244,0,0,75404,4138,6457,0,...,4068,7881,0,19807,4134,6247,0,0,5635,34287
pt__00003,36763,68495,40287,95309,0,0,41507,1589,24237,0,...,40750,57496,0,128947,39397,55763,0,0,27144,140189
pt__00004,60,28,55,14,0,0,0,18,53,0,...,293,526,0,152,22,241,0,0,36,99
pt__00005,3069,3830,271,28520,0,0,466,45,902,0,...,53804,8043,0,3010,1464,9920,0,0,721,1955


In [30]:
count_zeros(sum_by_group, threshold=0.5)

Threshold: 296 (threshold * number of columns)


Unnamed: 0_level_0,SRR6665481,SRR6665480,SRR6665490,SRR6665479,SRR6665489,SRR6665487,SRR6665478,SRR6665486,SRR6665483,SRR6665476,...,SRR6666061,SRR6666053,SRR6666059,SRR6666062,SRR6666064,SRR6666058,SRR6666066,SRR6666065,SRR6666067,SRR6666063
phylotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pt__00001,8122,5519,1854,48309,0,0,7870,673,3924,0,...,3091,18535,0,13633,7134,6358,0,0,3501,9797
pt__00002,171679,92936,56503,40244,0,0,75404,4138,6457,0,...,4068,7881,0,19807,4134,6247,0,0,5635,34287
pt__00003,36763,68495,40287,95309,0,0,41507,1589,24237,0,...,40750,57496,0,128947,39397,55763,0,0,27144,140189
pt__00004,60,28,55,14,0,0,0,18,53,0,...,293,526,0,152,22,241,0,0,36,99
pt__00005,3069,3830,271,28520,0,0,466,45,902,0,...,53804,8043,0,3010,1464,9920,0,0,721,1955
pt__00006,13194,19947,3620,321,0,0,21686,128,381,0,...,120,262,0,4236,2873,115,0,0,499,3072
pt__00007,638,4597,1853,17364,0,0,132,184,11553,0,...,6573,10621,0,549,600,7473,0,0,63662,87281
pt__00008,48,115,102,31,93,443,59,131,37,168,...,282,1373,41750,419,19,272,0,253,116,166
pt__00009,1869,299,63,38091,0,0,269,14,1010,0,...,22461,1542,0,186,1621,2091,0,0,242,1027
pt__00011,134,108,56,27,0,0,297,58,126,0,...,88,286,0,360,46,157,0,0,331,161


In [31]:
wallace_phylotypes_zeros_filtered = count_zeros(sum_by_group)
wallace_phylotypes_zeros_filtered.shape

Threshold: 296 (threshold * number of columns)


(16, 592)

In [32]:
wallace_phylotypes_zeros_filtered = wallace_phylotypes_zeros_filtered.rename(columns=run2my_sample_id)
wallace_phylotypes_zeros_filtered.columns = [str(x) for x in wallace_phylotypes_zeros_filtered.columns]
wallace_phylotypes_zeros_filtered.head()

Unnamed: 0_level_0,10343264_LMAN26_B46_ACCGTG,10343264_LMAN26_NC314_ACGTCT,10344826_LMAN8_EP1_AATCCG,10343264_LMAN26_CML10_AGACCA,SRR6665489,10344826_LMAN8_A654_CTCCAT,10343264_LMAN26_NC262_ACAGAT,10344826_LMAN8_ND246_CGTCGC,10343264_LMAN26_B73_ACTCTT,SRR6665476,...,SRR6666061,SRR6666053,SRR6666059,10344827_LMAN26_I137TN_ACATTA,10343927_LMAD26_CML154Q_ACAGAT,SRR6666058,10343927_LMAD26_NC344_AATGAA,10343927_LMAD26_T234_GTCAGG,10343927_LMAD26_K64_CCTGCT,10343264_LMAN26_CI64_ATATCC
phylotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pt__00001,8122,5519,1854,48309,0,0,7870,673,3924,0,...,3091,18535,0,13633,7134,6358,0,0,3501,9797
pt__00002,171679,92936,56503,40244,0,0,75404,4138,6457,0,...,4068,7881,0,19807,4134,6247,0,0,5635,34287
pt__00003,36763,68495,40287,95309,0,0,41507,1589,24237,0,...,40750,57496,0,128947,39397,55763,0,0,27144,140189
pt__00004,60,28,55,14,0,0,0,18,53,0,...,293,526,0,152,22,241,0,0,36,99
pt__00005,3069,3830,271,28520,0,0,466,45,902,0,...,53804,8043,0,3010,1464,9920,0,0,721,1955


In [33]:
# Filter the columns of the expression data
wallace_phylotypes_zeros_filtered = wallace_phylotypes_zeros_filtered.filter(items=kremling_expression_v5.columns)
wallace_phylotypes_zeros_filtered.head()

Unnamed: 0_level_0,10343264_LMAN26_B73_GTGTAG,10343264_LMAN26_NC262_ACAGAT,10343264_LMAN26_CML10_AGACCA,10343264_LMAN26_NC314_ACGTCT,10343264_LMAN26_B46_ACCGTG,10343264_LMAN26_B84_GTGCCA,10343264_LMAN26_B73_ACTCTT,10343264_LMAN26_B77_GTAGAA,10344826_LMAN8_F7_GGCTGC,10344826_LMAN8_ND246_CGTCGC,...,10344826_LMAD8_NC358_GCAGCC,10344826_LMAD8_NC294_CGATCT,10344826_LMAD8_K55_AAGACA,10344827_LMAN26_B73_GAACCT,10344827_LMAN26_I137TN_ACATTA,10343264_LMAN26_CI64_ATATCC,10343927_LMAD26_CML154Q_ACAGAT,10343927_LMAD26_T234_GTCAGG,10343927_LMAD26_NC344_AATGAA,10343927_LMAD26_K64_CCTGCT
phylotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
pt__00001,1808,7870,48309,5519,8122,1509,3924,10951,3005,673,...,0,0,0,4744,13633,9797,7134,0,0,3501
pt__00002,4254,75404,40244,92936,171679,7588,6457,29174,74475,4138,...,0,0,0,4225,19807,34287,4134,0,0,5635
pt__00003,11082,41507,95309,68495,36763,30789,24237,52004,10162,1589,...,0,0,0,11134,128947,140189,39397,0,0,27144
pt__00004,51,0,14,28,60,0,53,21,107,18,...,0,0,0,59,152,99,22,0,0,36
pt__00005,59,466,28520,3830,3069,336,902,1512,257,45,...,0,0,0,571,3010,1955,1464,0,0,721


In [34]:
concatenated_phylotypes_df = pd.concat([wallace_phylotypes_zeros_filtered, kremling_expression_v5_zeros_filtered], axis=0)
print(kremling_expression_v5_zeros_filtered.shape)
print(wallace_phylotypes_zeros_filtered.shape)
print(concatenated_phylotypes_df.shape)

(19953, 482)
(16, 482)
(19969, 482)


In [35]:
concatenated_phylotypes_transposed = concatenated_phylotypes_df.T

In [36]:
from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import numpy as np
from corals.correlation.full.default import cor_full

phylotypes_cor_values = cor_full(concatenated_phylotypes_transposed)
print(f'Finished all correlations with {phylotypes_cor_values.shape[0]} vs {phylotypes_cor_values.shape[1]}!')

highly_correlated_pairs_phylotypes_df = pd.DataFrame(columns=['feature1', 'feature2', 'correlation'])
correlated_pairs_phylotypes_file = open('correlated_pairs_phylotypes_v2.txt', 'w')

# Find the highly correlated pairs
for i in range(len(phylotypes_cor_values.columns)):
    for j in range(i+1, len(phylotypes_cor_values.columns)):
        if phylotypes_cor_values.columns[i] == phylotypes_cor_values.columns[j] or\
            (phylotypes_cor_values.columns[j].startswith('Zm') and phylotypes_cor_values.columns[i].startswith('Zm')) or\
            (not phylotypes_cor_values.columns[j].startswith('Zm') and not phylotypes_cor_values.columns[i].startswith('Zm')):
            continue
        if (abs(phylotypes_cor_values.iloc[i, j]) > 0.6 or\
           abs(phylotypes_cor_values.iloc[i, j]) < -0.6) and\
            (phylotypes_cor_values.columns[j].startswith('Zm') or phylotypes_cor_values.columns[i].startswith('Zm')):
            pair = (phylotypes_cor_values.columns[i], phylotypes_cor_values.columns[j])
            highly_correlated_pairs_phylotypes_df.loc[len(highly_correlated_pairs_phylotypes_df.index)] = [phylotypes_cor_values.columns[i],
                                                                                     phylotypes_cor_values.columns[j],
                                                                                     phylotypes_cor_values.iloc[i, j]]
            correlated_pairs_phylotypes_file.write(f'{phylotypes_cor_values.columns[i]}\t{phylotypes_cor_values.columns[j]}\t{phylotypes_cor_values.iloc[i, j]}\n')

correlated_pairs_phylotypes_file.close()



Finished all correlations with 19969 vs 19969!
