# Associations between molecular and environmental changes along the proximal-to-distal axis of the colon
## Abstract
### Objective
Colorectal cancer is a heterogeneous disease, and tumours in the left or right sections of the colon are biologically disparate. The development of the two-colon paradigm, differentiating colorectal cancers according to their location relative to the splenic flexure, contributed to an improvement in prognosis and treatment. Recent studies challenged this division by proposing a continuum model where molecular properties follow a continuous trend along the colon. 
### Design
We address the question of which model describes CRC properties better by comparing their performance in describing the properties of colorectal tumours in a cohort of 522 patients from The Cancer Genome Atlas.
### Results
Results show that no model outperforms the other. Alterations affecting genes associated with growth are described better by the two-colon paradigm, while the continuum colon model better approximates alterations affecting genes related to metabolism and environment. As this suggests an environmental impact on changes in selective constraints along the colon, we chart the localised metabolome in a cohort of 27 colon cancer patients. Metabolites follow a continuous trend in agreement with other tissue-environmental factors such as microbiome. We show that genes with continuous transcriptional profiles interact with metabolites associated with carcinogenesis, suggesting that gradients of metabolism-mediated selective constraints might contribute to gradual changes in tumours along the colon.
### Conclusion
Our results question the previous left/right model of colon cancer, suggesting that an increase in the resolution of tumour localisation might better capture the biology of tumours as systems interacting with their local environment and holding clinical relevance.
### About this notebook
This notebook allow to reproduce all the results and figures presented in [CITAZIONE] 

# LOADING LIBRARIES

In [3]:
%matplotlib inline
import sys
sys.path.append("../../../git/lib") # Path to the profile_analysis_class.py lib
sys.path.append("/home/ieo5417/Documenti/git/network_analysis/lib/") # Path to the network_search_class.py lib
from profile_analysis_class import ProfileAnalysis # Import the profile workflow class
import pandas as pd
import matplotlib.pyplot as plt
from supervenn import supervenn
import mygene
from scipy.stats import chi2_contingency
from joblib import Parallel, delayed
from statsmodels.stats.multitest import fdrcorrection
import matplotlib
import os
import statistics
import numpy as np
import math
from scipy import stats
from scipy.stats import beta, norm

from network_search_class import NetworkSearchAnalysis

# Defining left and right sections
left_sections = ['Descending colon','Sigmoid colon','Rectosigmoid junction','Rectum, NOS']
right_sections = ['Cecum','Ascending colon','Hepatic flexure of colon','Transverse colon']

mg = mygene.MyGeneInfo()
plt.style.use('../../assets/styles/plotting_style.mplstyle') # Path to the matplotlib style sheet

In [4]:
def dLR(data, left_sec = 'Sigmoid colon', right_sec = 'Transverse colon'):
    # Create a dataframe with the selected section as right left sections
    dLR = pd.DataFrame(index=data.index)
    dLR['left']=data[{left_sec}]
    dLR['right']=data[{right_sec}]
    dLR['tcga_r/l']=dLR['right']/dLR['left']
    return dLR

def calculate_p(name, observable, background):
    p_value = 1-stats.percentileofscore(background,observable)/100
    return (name, p_value)

def calculate_p_beta(name, observable, background):
    a, b, loc, scale = beta.fit(background)
    p_value = 1-beta.cdf(observable, a, b, loc, scale)
    return (name, p_value)

def calculate_p_norm(name, observable, background):
    mu, std = norm.fit(background.astype('float'))
    p_value = 1-norm.cdf(observable, mu, std)
    return (name, p_value)

def calculate_q(p_table):
    q_table = pd.DataFrame(index=p_table.index)
    for column in p_table:
        no_na_col = p_table[column].dropna()
        q_values = pd.DataFrame(fdrcorrection(no_na_col)[1], columns=[column], index=no_na_col.index)
        q_table = pd.concat([q_table, q_values], axis=1)
    return q_table
        
def unzip_backgrounds(backgrounds_list):
    backgrounds = {}
    for df in backgrounds_list:
        name = df.columns[0].split('_')[0]
        background = []
        for column in df.columns:
            background = background + df[df[column]>0][column].to_list()
        backgrounds[name] = background
    return backgrounds

def select_significant_models(p_values, observables_scores):
    for key in p_values:
        if p_values[key][1] > 0.05:
            observables_scores.drop(key, axis=1, inplace=True)
    return observables_scores

def assemble_backgrounds(poly_obs_scores, sig_perm_scores, poly_perm_scores):
    backgrounds = {}
    for column in poly_obs_scores.columns:
        if column == 'sigmoidal':
            backgrounds[column] = sig_perm_scores
        else:
            backgrounds[column] = poly_perm_scores[column-1]
    return backgrounds

def get_p_values(poly_obs_scores, backgrounds):
    p_value_tables = pd.DataFrame(index=poly_obs_scores.index, columns=poly_obs_scores.columns)
    for index, row in poly_obs_scores.iterrows():
        for model in row.index:
            score = row[model]
            background = backgrounds[model].loc[index].dropna()
            result = calculate_p_norm(index, score, background)
            p_value_tables.loc[result[0], model] = result[1]
    return p_value_tables

def get_q_values(p_value_tables, poly_obs_scores):
    q_value_tables = pd.DataFrame(index=poly_obs_scores.index, columns=poly_obs_scores.columns)
    for column in p_value_tables:
        q_value_tables[column] = fdrcorrection(p_value_tables[column])[1]
    return q_value_tables

def classify_genes(observables_scores, permutation_scores, q_values):
    columns = []
    for column in observables_scores.columns:
        columns.append(str(column))
    observables_scores.columns = columns
    q_values.columns = columns

    median_scores = {}
    for column in observables_scores:
        model_perm_score_table = permutation_scores[column]
        scores = []
        for run in model_perm_score_table:
            scores = scores + list(model_perm_score_table[run])
        median_scores[column] = statistics.mean(scores)
    
    models = {}
    for column in q_values:
        models[column] = list(q_values[q_values[column]<=0.2].index)

    continuous = []
    for key in models:
        if key != 'sigmoidal':
            continuous = continuous + models[key]
    continuous = set(continuous)

    classification = {}
    if 'sigmoidal' in models:
        sigmoid = set(models['sigmoidal'])
        common = continuous.intersection(sigmoid)
        sigmoid = [feature for feature in sigmoid if feature not in common]
        continuous = [feature for feature in continuous if feature not in common]
        common_scores = observables_scores.loc[common, q_values.columns]
        for feature in common:
            for key in models:
                if not feature in models[key]:
                    common_scores.loc[feature, key] = np.nan
        for column in common_scores:
            common_scores[column] = common_scores[column]/median_scores[column]
        max_continuous_score = common_scores.drop('sigmoidal', axis=1).idxmax(axis=1)

        for index, row in common_scores.iterrows():
            dcont = common_scores.loc[index,max_continuous_score.loc[index]]
            dsig = common_scores.loc[index,'sigmoidal']
            diff = abs(dcont - dsig)
            ratio = diff/max([dcont, dsig])*100
            common_scores.loc[index, 'c\s'] = ratio

        classification['discarded'] = list(common_scores[common_scores['c\s']<5].index)

        common_scores.drop(classification['discarded'], axis=0, inplace=True)
        best_models = pd.DataFrame(common_scores[q_values.columns].astype('float').idxmax(axis=1), columns=['model'])
        for index, row in best_models.iterrows():
            if row['model'] == 'sigmoidal':
                sigmoid.append(index)
            else:
                continuous.append(index)
        classification['sigmoid'] = sigmoid
    classification['continuous'] = continuous
    return classification

def classify_genes_by_q(observables_scores, permutation_scores, q_values):
    columns = []
    for column in observables_scores.columns:
        columns.append(str(column))
    observables_scores.columns = columns
    q_values.columns = columns

    median_scores = {}
    for column in observables_scores:
        model_perm_score_table = permutation_scores[column]
        scores = []
        for run in model_perm_score_table:
            scores = scores + list(model_perm_score_table[run])
        median_scores[column] = statistics.mean(scores)
    
    models = {}
    for column in q_values:
        models[column] = list(q_values[q_values[column]<=0.2].index)

    continuous = []
    for key in models:
        if key != 'sigmoidal':
            continuous = continuous + models[key]
    continuous = set(continuous)

    classification = {}
    if 'sigmoidal' in models:
        sigmoid = set(models['sigmoidal'])
        common = continuous.intersection(sigmoid)
        sigmoid = [feature for feature in sigmoid if feature not in common]
        continuous = [feature for feature in continuous if feature not in common]
        common_scores = observables_scores.loc[common, q_values.columns]
        for feature in common:
            for key in models:
                if not feature in models[key]:
                    common_scores.loc[feature, key] = np.nan
        for column in common_scores:
            common_scores[column] = common_scores[column]/median_scores[column]
        max_continuous_score = common_scores.drop('sigmoidal', axis=1).idxmax(axis=1)

        for index, row in common_scores.iterrows():
            dcont = common_scores.loc[index,max_continuous_score.loc[index]]
            dsig = common_scores.loc[index,'sigmoidal']
            diff = abs(dcont - dsig)
            ratio = diff/max([dcont, dsig])*100
            common_scores.loc[index, 'c\s'] = ratio

        classification['discarded'] = list(common_scores[common_scores['c\s']<5].index)

        common_scores.drop(classification['discarded'], axis=0, inplace=True)
        common_qs = common_scores.index
        best_models = pd.DataFrame(q_values.loc[common_qs].astype('float').idxmin(axis=1), columns=['model'])
        for index, row in best_models.iterrows():
            if row['model'] == 'sigmoidal':
                sigmoid.append(index)
            else:
                continuous.append(index)
        classification['sigmoid'] = sigmoid
    classification['continuous'] = continuous
    return classification

def classify_genes_by_q_alternative(observables_scores, permutation_scores, q_values):
    columns = []
    for column in observables_scores.columns:
        columns.append(str(column))
    observables_scores.columns = columns
    q_values.columns = columns

    models = {}
    for column in q_values:
        models[column] = list(q_values[q_values[column]<=0.2].index)

    continuous = []
    for key in models:
        if key != 'sigmoidal':
            continuous = continuous + models[key]
    continuous = set(continuous)

    classification = {}
    if 'sigmoidal' in models:
        sigmoid = set(models['sigmoidal'])
        common = continuous.intersection(sigmoid)
        sigmoid = [feature for feature in sigmoid if feature not in common]
        continuous = [feature for feature in continuous if feature not in common]
        common_scores = q_values.loc[common, q_values.columns]
        common_scores = common_scores.astype(float)
        max_continuous_score = common_scores.drop('sigmoidal', axis=1).idxmin(axis=1)

        for index, row in common_scores.iterrows():
            dcont = common_scores.loc[index,max_continuous_score.loc[index]]
            dsig = common_scores.loc[index,'sigmoidal']
            diff = abs(dcont - dsig)
            ratio = diff/max([dcont, dsig])*100
            common_scores.loc[index, 'c\s'] = ratio

        classification['discarded'] = list(common_scores[common_scores['c\s']<5].index)
        common_scores.drop(classification['discarded'], axis=0, inplace=True)
        common_qs = common_scores.index
        best_models = pd.DataFrame(q_values.loc[common_qs].astype('float').idxmin(axis=1), columns=['model'])
        for index, row in best_models.iterrows():
            if row['model'] == 'sigmoidal':
                sigmoid.append(index)
            else:
                continuous.append(index)
        classification['sigmoid'] = sigmoid
    classification['continuous'] = continuous
    return classification

# TRANSCRIPTOME PROFILING

In [5]:
# Create workflow class, specifying the path to the SETTINGS.ini file
pa_transcriptome = ProfileAnalysis('../../../docker/analysis/transcriptome')

Project has been created!


