# Running correlations beteween gene expression and OTU abundance


## Filtering lowly expressed genes and lowly abundant OTUs

 * Remove genes with less than one mapped read per million reads in at least 80% of samples ([Johnson and Krishnan, 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9))
 * Retaining OTUs found at 0.001 relative abundance in at least 10% of the samples ([Priya et al 2022](https://www.nature.com/articles/s41564-022-01121-z))

### Filtering genes

Since RPKM will be used to filter out genes with low expression, it must be imported first:

In [2]:
import pandas as pd

kremling_expression_v5_day_rpkm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_day_rpkm.tsv',
                            sep='\t')
kremling_expression_v5_day_rpkm.set_index('Name', inplace=True)

kremling_expression_v5_night_rpkm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_night_rpkm.tsv',
                            sep='\t')
kremling_expression_v5_night_rpkm.set_index('Name', inplace=True)

These are the genes to be used in filtering steps:

In [3]:
genes_tokeep_day = kremling_expression_v5_day_rpkm[(kremling_expression_v5_day_rpkm > 1).sum(axis=1) >= (kremling_expression_v5_day_rpkm.shape[1] * 0.8)].index
genes_tokeep_night = kremling_expression_v5_night_rpkm[(kremling_expression_v5_night_rpkm > 1).sum(axis=1) >= (kremling_expression_v5_night_rpkm.shape[1] * 0.8)].index
print('Genes to keep in day:', len(genes_tokeep_day))
print('Genes to keep in night:', len(genes_tokeep_night))

Genes to keep in day: 13107
Genes to keep in night: 13630


Importing the TPM matrices and filter genes:

In [4]:
kremling_expression_v5_day_tpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_day_tpm.tsv',
                            sep='\t')
kremling_expression_v5_day_tpm.set_index('Name', inplace=True)
print(kremling_expression_v5_day_tpm.shape)

kremling_expression_v5_night_tpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_night_tpm.tsv',
                            sep='\t')
kremling_expression_v5_night_tpm.set_index('Name', inplace=True)
print(kremling_expression_v5_night_tpm.shape)

(39096, 176)
(39096, 228)


In [5]:
kremling_expression_v5_day_tpm_filtered = kremling_expression_v5_day_tpm[kremling_expression_v5_day_tpm.index.isin(genes_tokeep_day)]
kremling_expression_v5_night_tpm_filtered = kremling_expression_v5_night_tpm[kremling_expression_v5_night_tpm.index.isin(genes_tokeep_night)]

(13107, 176)
(13630, 228)


In [29]:
print(kremling_expression_v5_day_tpm_filtered.shape)
print(kremling_expression_v5_night_tpm_filtered.shape)

(13107, 176)
(13630, 228)


Importing the CPM matrices and filter genes:

In [27]:
kremling_expression_v5_day_cpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_day_cpm.tsv',
                            sep='\t')
kremling_expression_v5_day_cpm.set_index('Name', inplace=True)

kremling_expression_v5_night_cpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/rnaseq_kremling2018/quantification/kremling_expression_v5_night_cpm.tsv',
                            sep='\t')
kremling_expression_v5_night_cpm.set_index('Name', inplace=True)


In [28]:
kremling_expression_v5_day_cpm_filtered = kremling_expression_v5_day_cpm[kremling_expression_v5_day_cpm.index.isin(genes_tokeep_day)]
kremling_expression_v5_night_cpm_filtered = kremling_expression_v5_night_cpm[kremling_expression_v5_night_cpm.index.isin(genes_tokeep_night)]

In [30]:
print(kremling_expression_v5_day_cpm_filtered.shape)
print(kremling_expression_v5_night_cpm_filtered.shape)

(13107, 176)
(13630, 228)


### Filtering OTUs

Since relative abundance will be used to filter out genes with low expression, it must be imported first:

In [6]:
otu_table_merged_day_relative_abund = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/combine_day_night_samples/summed_day_night_otu_day_relative_abund.tsv',
                            sep='\t')
otu_table_merged_day_relative_abund.set_index('OTU ID', inplace=True)

otu_table_merged_night_relative_abund = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/combine_day_night_samples/summed_day_night_otu_night_relative_abund.tsv',
                            sep='\t')
otu_table_merged_night_relative_abund.set_index('OTU ID', inplace=True)

In [7]:
otus_tokeep_day = otu_table_merged_day_relative_abund[(otu_table_merged_day_relative_abund > 0.001).sum(axis=1) >= (otu_table_merged_day_relative_abund.shape[1] * 0.1)].index
otus_tokeep_night = otu_table_merged_night_relative_abund[(otu_table_merged_night_relative_abund > 0.001).sum(axis=1) >= (otu_table_merged_night_relative_abund.shape[1] * 0.1)].index

Importing the OTU CPM matrices and filter OTUs:

In [8]:
otu_table_merged_day_cpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/combine_day_night_samples/summed_d_n_otu_day_cpm.tsv',
                            sep='\t')
otu_table_merged_day_cpm.set_index('OTU ID', inplace=True)

otu_table_merged_night_cpm = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/16S_wallace2018/combine_day_night_samples/summed_d_n_otu_night_cpm.tsv',
                            sep='\t')
otu_table_merged_night_cpm.set_index('OTU ID', inplace=True)


In [9]:
otu_table_merged_day_cpm_filtered = otu_table_merged_day_cpm[otu_table_merged_day_cpm.index.isin(otus_tokeep_day)]
otu_table_merged_night_cpm_filtered = otu_table_merged_night_cpm[otu_table_merged_night_cpm.index.isin(otus_tokeep_night)]

## Filtering low variance (expression or OTUs)


[Priya et al 2022](https://www.nature.com/articles/s41564-022-01121-z) used 25% quantile as cutoff for gene expression analysis.

In [10]:
import numpy as np

# Calculate the coefficient of variation for each row
kremling_expression_v5_day_tpm_filtered_cv = kremling_expression_v5_day_tpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)
kremling_expression_v5_night_tpm_filtered_cv = kremling_expression_v5_night_tpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)

Filtering the Gene TPM matrix:

In [11]:
kremling_expression_v5_night_tpm_filtered_cv_filtered = kremling_expression_v5_night_tpm_filtered.loc[kremling_expression_v5_night_tpm_filtered_cv[kremling_expression_v5_night_tpm_filtered_cv > kremling_expression_v5_night_tpm_filtered_cv.quantile(q=0.25)].index]
kremling_expression_v5_day_tpm_filtered_cv_filtered = kremling_expression_v5_day_tpm_filtered.loc[kremling_expression_v5_day_tpm_filtered_cv[kremling_expression_v5_day_tpm_filtered_cv > kremling_expression_v5_day_tpm_filtered_cv.quantile(q=0.25)].index]
print(kremling_expression_v5_night_tpm_filtered_cv_filtered.shape)
print(kremling_expression_v5_day_tpm_filtered_cv_filtered.shape)

(10222, 228)
(9830, 176)


In [31]:
# Calculate the coefficient of variation for each row
kremling_expression_v5_day_cpm_filtered_cv = kremling_expression_v5_day_cpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)
kremling_expression_v5_night_cpm_filtered_cv = kremling_expression_v5_night_cpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)

Filtering the Gene CPM matrix:

In [32]:
kremling_expression_v5_night_cpm_filtered_cv_filtered = kremling_expression_v5_night_cpm_filtered.loc[kremling_expression_v5_night_cpm_filtered_cv[kremling_expression_v5_night_cpm_filtered_cv > kremling_expression_v5_night_cpm_filtered_cv.quantile(q=0.25)].index]
kremling_expression_v5_day_cpm_filtered_cv_filtered = kremling_expression_v5_day_cpm_filtered.loc[kremling_expression_v5_day_cpm_filtered_cv[kremling_expression_v5_day_cpm_filtered_cv > kremling_expression_v5_day_cpm_filtered_cv.quantile(q=0.25)].index]
print(kremling_expression_v5_night_cpm_filtered_cv_filtered.shape)
print(kremling_expression_v5_day_cpm_filtered_cv_filtered.shape)

(10222, 228)
(9830, 176)


Filtering the OTU CPM matrix:

In [12]:
# Calculate the coefficient of variation for each row
otu_table_merged_day_cpm_filtered_cv = otu_table_merged_day_cpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)
otu_table_merged_night_cpm_filtered_cv = otu_table_merged_night_cpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)

In [13]:
otu_table_merged_day_cpm_filtered_cv_filtered = otu_table_merged_day_cpm_filtered.loc[otu_table_merged_day_cpm_filtered_cv[otu_table_merged_day_cpm_filtered_cv > otu_table_merged_day_cpm_filtered_cv.quantile(q=0.25)].index]
otu_table_merged_night_cpm_filtered_cv_filtered = otu_table_merged_night_cpm_filtered.loc[otu_table_merged_night_cpm_filtered_cv[otu_table_merged_night_cpm_filtered_cv > otu_table_merged_day_cpm_filtered_cv.quantile(q=0.25)].index]

## Correlations

In [14]:
from corals.threads import set_threads_for_external_libraries
set_threads_for_external_libraries(n_threads=1)
import numpy as np
from corals.correlation.full.default import cor_full



### OTU (CPM) - Gene (CPM)

In [35]:
concat_df_night = pd.concat([kremling_expression_v5_night_cpm_filtered_cv_filtered, otu_table_merged_night_cpm_filtered_cv_filtered], axis=0)
concat_df_day = pd.concat([kremling_expression_v5_day_cpm_filtered_cv_filtered, otu_table_merged_day_cpm_filtered_cv_filtered], axis=0)

concatenated_transposed_day = concat_df_day.transpose()
concatenated_transposed_night = concat_df_night.transpose()

cor_values_day = cor_full(concatenated_transposed_day)
cor_values_night = cor_full(concatenated_transposed_night)

true_positions_day = np.where(cor_values_day > 0.6)
shape_row_day = kremling_expression_v5_day_cpm_filtered_cv_filtered.shape[0]
true_positions_night = np.where(cor_values_night > 0.6)
shape_row_night = kremling_expression_v5_night_cpm_filtered_cv_filtered.shape[0]

pairs_day_genecpm_otucpm = []
pairs_night_genecpm_otucpm = []

for i in range(len(true_positions_day[0])):
    if (true_positions_day[1][i] > (shape_row_day - 1)) and (true_positions_day[0][i] < shape_row_day):
        pairs_day_genecpm_otucpm.append((str(cor_values_day.columns[true_positions_day[1][i]]),
              str(cor_values_day.index[true_positions_day[0][i]])))

for i in range(len(true_positions_night[0])):
    if (true_positions_night[1][i] > (shape_row_night - 1)) and (true_positions_night[0][i] < shape_row_night):
        pairs_night_genecpm_otucpm.append((str(cor_values_night.columns[true_positions_night[1][i]]),
              str(cor_values_night.index[true_positions_night[0][i]])))

### OTU (CPM) - Gene (TPM)

In [36]:
concat_df_night = pd.concat([kremling_expression_v5_night_tpm_filtered_cv_filtered, otu_table_merged_night_cpm_filtered_cv_filtered], axis=0)
concat_df_day = pd.concat([kremling_expression_v5_day_tpm_filtered_cv_filtered, otu_table_merged_day_cpm_filtered_cv_filtered], axis=0)

concatenated_transposed_day = concat_df_day.transpose()
concatenated_transposed_night = concat_df_night.transpose()

cor_values_day = cor_full(concatenated_transposed_day)
cor_values_night = cor_full(concatenated_transposed_night)

true_positions_day = np.where(cor_values_day > 0.6)
shape_row_day = kremling_expression_v5_day_tpm_filtered_cv_filtered.shape[0]
true_positions_night = np.where(cor_values_night > 0.6)
shape_row_night = kremling_expression_v5_night_tpm_filtered_cv_filtered.shape[0]

pairs_day_genetpm_otucpm = []
pairs_night_genetpm_otucpm = []

for i in range(len(true_positions_day[0])):
    if (true_positions_day[1][i] > (shape_row_day - 1)) and (true_positions_day[0][i] < shape_row_day):
        pairs_day_genetpm_otucpm.append((str(cor_values_day.columns[true_positions_day[1][i]]),
              str(cor_values_day.index[true_positions_day[0][i]])))

for i in range(len(true_positions_night[0])):
    if (true_positions_night[1][i] > (shape_row_night - 1)) and (true_positions_night[0][i] < shape_row_night):
        pairs_night_genetpm_otucpm.append((str(cor_values_night.columns[true_positions_night[1][i]]),
              str(cor_values_night.index[true_positions_night[0][i]])))


In [37]:
print(len(pairs_day_genecpm_otucpm))
print(len(pairs_night_genecpm_otucpm))
print(len(pairs_day_genetpm_otucpm))
print(len(pairs_night_genetpm_otucpm))

599
110
552
112


In [41]:
print(len(set(pairs_day_genecpm_otucpm).intersection(pairs_day_genetpm_otucpm)))
print(len(set(pairs_night_genecpm_otucpm).intersection(pairs_night_genetpm_otucpm)))

544
104


### OTU (CPM) - Gene (TMM)

### OTU (CPM) - Gene (RPKM)

### OTU (CPM) - Gene (UQ)

### OTU (CPM) - Gene (CTF)

### OTU (CPM) - Gene (CUF)

### OTU (CPM) - Gene (CPM) - ASINH

### OTU (CPM) - Gene (TPM) - ASINH

### OTU (CPM) - Gene (TMM) - ASINH

### OTU (CPM) - Gene (RPKM) - ASINH

### OTU (CPM) - Gene (UQ) - ASINH

### OTU (CPM) - Gene (CTF) - ASINH

### OTU (CPM) - Gene (CUF) - ASINH