
  TCGA Preliminary Analysis:


The Cancer Genome Atlas (TCGA) contains molecular profiles for 33 different cancer-types, based on over 20,000 samples. The specific dataset I’m working with has data from 4 groups. Breast cancer and liver cancer, and for each, tumour and non-tumour samples. In each of these four categories, we have 3 patient samples. As such, this dataset should be useful in answering the question: “By gene expression, are each of these cancer-types more similar to eachother, or their respective non-tumour tissue samples?” The answer will, of-course vary by the genes in question, hence why a plot of such would be useful. There is a use for both: 1. Identifying genes specific to each cancer-type is crucial for distinguishing between different cancers. It would, for example, be very useful if a blood sample could be used to screen for cancers in different tissues. 2. On the other hand, identifying genes commonly modified in all or most cancer types has the poteintial to provide the most powerful, most widely applicable screening tools.

The issue is that underlying tissues express different genes to at different baseline levels. Hence, it is obvious that there will be some genes expressed different in cancers formed from different tissue types. There is also significant documentation already in existence of various genes differentially expressed in different cancer types.

So, a more interesting question to ask: Assuming those genes in different tissues with the same baseline expression (eg. housekeeping genes), how does expression of these genes differ in different cancer types? So, first we must select for those genes with similar expression in non-tumour samples of both breast and liver tissue. Thus, establish a consistent baseline. Next, we select for a large log-fold change over baseline expression, in order to weed out those genes which are not differently expressed in either cancer. This will likely be the majority of genes that pass the first screening process (due to zero-inflation). Finally, of the genes which pass both of the above processes, are they expressed strongly in either or both types of cancer?

Method: 1. Starting with the non-tumour samples, conduct 2-sample T-tests of breast vs liver tissue for all genes. Generate a p-value, and store in an additional column of the dataframe. 2.  Conduct a second round of t-test,this time between all 6 of the tumour samples and all 6 of the non-tumour samples. Again, save (1-p) as a column. This is just to weed out genes whose expression is not affected by either cancer, and so would be more likely to have the same baseline expression. 3. Finally, conduct the last set of t-tests, between tumour samples for each gene. 

Display the whole thing via nice 3-D interactive plot. Pick a few genes from either end of the Z-axis to spotlight.

In [None]:
!pip install researchpy --q

In [None]:
#import pandas
import pandas as pd
import numpy as np
import scipy as scipy
import researchpy as rp
import scipy.stats as stats
import plotly as pl

import warnings
warnings.filterwarnings('ignore')

In [None]:
GeneExpression = pd.read_table('https://raw.githubusercontent.com/PineBiotech/omicslogic/master/LIHC_BRCA_data1_marked.txt')
GeneExpression 

Unnamed: 0,Id,TCGA-2V-A95S-01A-11R-A37K-07_LIHC_TP,TCGA-2Y-A9GS-01A-12R-A38B-07_LIHC_TP,TCGA-2Y-A9GU-01A-11R-A38B-07_LIHC_TP,TCGA-BC-A10Q-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10T-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10W-11A-11R-A131-07_LIHC_NT,TCGA-3C-AAAU-01A-11R-A41B-07_BRCA_TP,TCGA-3C-AALJ-01A-31R-A41B-07_BRCA_TP,TCGA-3C-AALK-01A-11R-A41B-07_BRCA_TP,TCGA-A7-A0CE-11A-21R-A089-07_BRCA_NT,TCGA-A7-A0CH-11A-32R-A089-07_BRCA_NT,TCGA-A7-A0D9-11A-53R-A089-07_BRCA_NT
0,class,LIHC_TP,LIHC_TP,LIHC_TP,LIHC_NT,LIHC_NT,LIHC_NT,BRCA_TP,BRCA_TP,BRCA_TP,BRCA_NT,BRCA_NT,BRCA_NT
1,a100130426,0,0,0,0,0,0,0,0.9066,0,0,0,0
2,a100133144,2.31,53.59,6.86,2,1.41,4.94,16.3644,11.6228,12.0894,4.3333,4.2087,3.055
3,a100134869,5.69,5.41,6.14,0,2.59,1.06,12.9316,9.2294,11.0799,3.9206,2.1852,0
4,a10357,138.3,144.07,73.93,103.92,96.89,97.03,52.1503,154.2974,143.8643,78.9238,53.638,87.5764
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20527,ZYX|7791,4869,10756,3708,2799,1661,4915,3507.2482,5458.7489,5691.3529,6455.873,6038.9281,2344.7047
20528,ZZEF1|23140,1366,1533,1606,493,320,638,1894.9342,942.883,781.1336,1314.2857,1477.386,1997.9633
20529,ZZZ3|26009,783,1746,412,486,694,482,1180.4565,509.5195,700.8688,968.254,620.9685,730.6517
20530,psiTPTE22|387590,6,13,3,14,2,13,1.7233,35.3581,66.6115,265.3968,466.7607,346.7413


In [None]:
#Formatting the dataframe to better work with it 1. Set gene as index. 2. Remove the row of all text, and stash it for further use.
GeneExpression.index = GeneExpression.Id
Types = GeneExpression[0:1]
GeneExpression = GeneExpression[1:]
GeneExpression = GeneExpression.drop(['Id'], axis = 1) 
GeneExpressionNum = GeneExpression.apply(pd.to_numeric)                          #Could also have used .astype(float)
# Types
GeneExpressionNum

Unnamed: 0_level_0,TCGA-2V-A95S-01A-11R-A37K-07_LIHC_TP,TCGA-2Y-A9GS-01A-12R-A38B-07_LIHC_TP,TCGA-2Y-A9GU-01A-11R-A38B-07_LIHC_TP,TCGA-BC-A10Q-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10T-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10W-11A-11R-A131-07_LIHC_NT,TCGA-3C-AAAU-01A-11R-A41B-07_BRCA_TP,TCGA-3C-AALJ-01A-31R-A41B-07_BRCA_TP,TCGA-3C-AALK-01A-11R-A41B-07_BRCA_TP,TCGA-A7-A0CE-11A-21R-A089-07_BRCA_NT,TCGA-A7-A0CH-11A-32R-A089-07_BRCA_NT,TCGA-A7-A0D9-11A-53R-A089-07_BRCA_NT
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
a100130426,0.00,0.00,0.00,0.00,0.00,0.00,0.0000,0.9066,0.0000,0.0000,0.0000,0.0000
a100133144,2.31,53.59,6.86,2.00,1.41,4.94,16.3644,11.6228,12.0894,4.3333,4.2087,3.0550
a100134869,5.69,5.41,6.14,0.00,2.59,1.06,12.9316,9.2294,11.0799,3.9206,2.1852,0.0000
a10357,138.30,144.07,73.93,103.92,96.89,97.03,52.1503,154.2974,143.8643,78.9238,53.6380,87.5764
a10431,1561.00,1297.00,1423.00,1454.00,1125.00,2128.00,408.0760,1360.8341,865.5358,978.4127,970.7569,770.3666
...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX|7791,4869.00,10756.00,3708.00,2799.00,1661.00,4915.00,3507.2482,5458.7489,5691.3529,6455.8730,6038.9281,2344.7047
ZZEF1|23140,1366.00,1533.00,1606.00,493.00,320.00,638.00,1894.9342,942.8830,781.1336,1314.2857,1477.3860,1997.9633
ZZZ3|26009,783.00,1746.00,412.00,486.00,694.00,482.00,1180.4565,509.5195,700.8688,968.2540,620.9685,730.6517
psiTPTE22|387590,6.00,13.00,3.00,14.00,2.00,13.00,1.7233,35.3581,66.6115,265.3968,466.7607,346.7413


1. Generate the p-values for comparing non-tumour samples. A lower p-value here represents a lower probability that the gene has similar baseline expression in the two different tissues being studied. Thus, selecting for a high p-value indicates a higher likelihood that the gene in question has the same underlying distribution in both tissues. These are the genes which are of interest to us.

In [None]:
NT_BaselineP_Value = scipy.stats.ttest_ind(GeneExpressionNum.iloc[:, 3:6], GeneExpressionNum.iloc[:, 9:12], equal_var = True, axis = 1).pvalue
#LIHC['LIHC_t-test'] = LIHC_results
# NT_BaselineP_Value
#Must convert to pandas dataframe.
NT_BaselineP_Value = pd.DataFrame(NT_BaselineP_Value, columns = ['NT_BaselineP_Value'])
NT_BaselineP_Value.index = GeneExpressionNum.index
# NT_BaselineP_Value
NT_Baseline = pd.concat([GeneExpressionNum.iloc[:, 3:6], GeneExpressionNum.iloc[:, 9:12], NT_BaselineP_Value], join='outer', axis =1)
NT_Baseline

Unnamed: 0_level_0,TCGA-BC-A10Q-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10T-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10W-11A-11R-A131-07_LIHC_NT,TCGA-A7-A0CE-11A-21R-A089-07_BRCA_NT,TCGA-A7-A0CH-11A-32R-A089-07_BRCA_NT,TCGA-A7-A0D9-11A-53R-A089-07_BRCA_NT,NT_BaselineP_Value
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
a100130426,0.00,0.00,0.00,0.0000,0.0000,0.0000,
a100133144,2.00,1.41,4.94,4.3333,4.2087,3.0550,0.405483
a100134869,0.00,2.59,1.06,3.9206,2.1852,0.0000,0.579891
a10357,103.92,96.89,97.03,78.9238,53.6380,87.5764,0.068195
a10431,1454.00,1125.00,2128.00,978.4127,970.7569,770.3666,0.094029
...,...,...,...,...,...,...,...
ZYX|7791,2799.00,1661.00,4915.00,6455.8730,6038.9281,2344.7047,0.323060
ZZEF1|23140,493.00,320.00,638.00,1314.2857,1477.3860,1997.9633,0.007871
ZZZ3|26009,486.00,694.00,482.00,968.2540,620.9685,730.6517,0.152018
psiTPTE22|387590,14.00,2.00,13.00,265.3968,466.7607,346.7413,0.003952


2. Generate p-values to screen for significant changes in expression. Thus, we are now selecting for a small p-value, indicating low probability that the non-tumour and tumour samples have the same underlying distribution. Givent hat there are two sets of samples to compare, the only responsible way to plot this is to generate separate p-values. (Eg. Combining all non-tumour samples and all tumour samples for an n = 6 t-test would result in very high variance, and ultimately be useless.)

In [None]:
ChangeinExpression_P_Value_L = scipy.stats.ttest_ind(GeneExpressionNum.iloc[:, 0:3], GeneExpressionNum.iloc[:, 3:6], equal_var = True, axis = 1).pvalue
# ChangeinExpression_P_Value
#Must convert to pandas dataframe.
ChangeinExpression_P_Value_L = pd.DataFrame(ChangeinExpression_P_Value_L, columns = ['ChangeinExpression_P_Value_LIHC'])
ChangeinExpression_P_Value_L.index = GeneExpressionNum.index
# ChangeinExpression_P_Value
ChangeinExpression_L = pd.concat([GeneExpressionNum.iloc[:, 0:3], GeneExpressionNum.iloc[:, 3:6], ChangeinExpression_P_Value_L], join='outer', axis =1)
ChangeinExpression_L

Unnamed: 0_level_0,TCGA-2V-A95S-01A-11R-A37K-07_LIHC_TP,TCGA-2Y-A9GS-01A-12R-A38B-07_LIHC_TP,TCGA-2Y-A9GU-01A-11R-A38B-07_LIHC_TP,TCGA-BC-A10Q-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10T-11A-11R-A131-07_LIHC_NT,TCGA-BC-A10W-11A-11R-A131-07_LIHC_NT,ChangeinExpression_P_Value_LIHC
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
a100130426,0.00,0.00,0.00,0.00,0.00,0.00,
a100133144,2.31,53.59,6.86,2.00,1.41,4.94,0.331431
a100134869,5.69,5.41,6.14,0.00,2.59,1.06,0.004399
a10357,138.30,144.07,73.93,103.92,96.89,97.03,0.437173
a10431,1561.00,1297.00,1423.00,1454.00,1125.00,2128.00,0.665620
...,...,...,...,...,...,...,...
ZYX|7791,4869.00,10756.00,3708.00,2799.00,1661.00,4915.00,0.235732
ZZEF1|23140,1366.00,1533.00,1606.00,493.00,320.00,638.00,0.000935
ZZZ3|26009,783.00,1746.00,412.00,486.00,694.00,482.00,0.350447
psiTPTE22|387590,6.00,13.00,3.00,14.00,2.00,13.00,0.655799


In [None]:
ChangeinExpression_P_Value_B = scipy.stats.ttest_ind(GeneExpressionNum.iloc[:, 6:9], GeneExpressionNum.iloc[:, 9:12], equal_var = True, axis = 1).pvalue
# ChangeinExpression_P_Value
#Must convert to pandas dataframe.
ChangeinExpression_P_Value_B = pd.DataFrame(ChangeinExpression_P_Value_B, columns = ['ChangeinExpression_P_Value_BRCA'])
ChangeinExpression_P_Value_B.index = GeneExpressionNum.index
# ChangeinExpression_P_Value
ChangeinExpression_B = pd.concat([GeneExpressionNum.iloc[:, 6:9], GeneExpressionNum.iloc[:, 9:12], ChangeinExpression_P_Value_B], join='outer', axis =1)
ChangeinExpression_B

Unnamed: 0_level_0,TCGA-3C-AAAU-01A-11R-A41B-07_BRCA_TP,TCGA-3C-AALJ-01A-31R-A41B-07_BRCA_TP,TCGA-3C-AALK-01A-11R-A41B-07_BRCA_TP,TCGA-A7-A0CE-11A-21R-A089-07_BRCA_NT,TCGA-A7-A0CH-11A-32R-A089-07_BRCA_NT,TCGA-A7-A0D9-11A-53R-A089-07_BRCA_NT,ChangeinExpression_P_Value_BRCA
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
a100130426,0.0000,0.9066,0.0000,0.0000,0.0000,0.0000,0.373901
a100133144,16.3644,11.6228,12.0894,4.3333,4.2087,3.0550,0.003710
a100134869,12.9316,9.2294,11.0799,3.9206,2.1852,0.0000,0.004384
a10357,52.1503,154.2974,143.8643,78.9238,53.6380,87.5764,0.271066
a10431,408.0760,1360.8341,865.5358,978.4127,970.7569,770.3666,0.925098
...,...,...,...,...,...,...,...
ZYX|7791,3507.2482,5458.7489,5691.3529,6455.8730,6038.9281,2344.7047,0.969213
ZZEF1|23140,1894.9342,942.8830,781.1336,1314.2857,1477.3860,1997.9633,0.388805
ZZZ3|26009,1180.4565,509.5195,700.8688,968.2540,620.9685,730.6517,0.921093
psiTPTE22|387590,1.7233,35.3581,66.6115,265.3968,466.7607,346.7413,0.006115


3. Finally, to compare expression in either tumour type.

In the planning of this analysis I had an interesting idea, which was to add a fourth dimension through the use of colour. So, conduct on final t-test, between tumour samples in either tissue. The above information can be displayed categorically, based on the p-value generated: 1. A "high-correlation" group, with p-value < 0.1. 2. A "low-correlation" group with p-value > 0.9.

In [None]:
TumourExpression_Comparison_P_Value = scipy.stats.ttest_ind(GeneExpressionNum.iloc[:, 0:3], GeneExpressionNum.iloc[:, 6:9], equal_var = True, axis = 1).pvalue
# ChangeinExpression_P_Value
#Must convert to pandas dataframe.
TumourExpression_Comparison_P_Value = pd.DataFrame(TumourExpression_Comparison_P_Value, columns = ['TumourExpression_Comparison'])
TumourExpression_Comparison_P_Value.index = GeneExpressionNum.index
# ChangeinExpression_P_Value
TumourExpression_Comparison = pd.concat([GeneExpressionNum.iloc[:, 0:3], GeneExpressionNum.iloc[:, 6:9], TumourExpression_Comparison_P_Value], join='outer', axis =1)
TumourExpression_Comparison

Unnamed: 0_level_0,TCGA-2V-A95S-01A-11R-A37K-07_LIHC_TP,TCGA-2Y-A9GS-01A-12R-A38B-07_LIHC_TP,TCGA-2Y-A9GU-01A-11R-A38B-07_LIHC_TP,TCGA-3C-AAAU-01A-11R-A41B-07_BRCA_TP,TCGA-3C-AALJ-01A-31R-A41B-07_BRCA_TP,TCGA-3C-AALK-01A-11R-A41B-07_BRCA_TP,TumourExpression_Comparison
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
a100130426,0.00,0.00,0.00,0.0000,0.9066,0.0000,0.373901
a100133144,2.31,53.59,6.86,16.3644,11.6228,12.0894,0.669774
a100134869,5.69,5.41,6.14,12.9316,9.2294,11.0799,0.008075
a10357,138.30,144.07,73.93,52.1503,154.2974,143.8643,0.962099
a10431,1561.00,1297.00,1423.00,408.0760,1360.8341,865.5358,0.126895
...,...,...,...,...,...,...,...
ZYX|7791,4869.00,10756.00,3708.00,3507.2482,5458.7489,5691.3529,0.533319
ZZEF1|23140,1366.00,1533.00,1606.00,1894.9342,942.8830,781.1336,0.451793
ZZZ3|26009,783.00,1746.00,412.00,1180.4565,509.5195,700.8688,0.701272
psiTPTE22|387590,6.00,13.00,3.00,1.7233,35.3581,66.6115,0.224457


In [None]:
FinalAnalysisData = pd.concat([NT_BaselineP_Value, ChangeinExpression_P_Value_L, ChangeinExpression_P_Value_B, TumourExpression_Comparison_P_Value], join='outer', axis =1)
FinalAnalysisData

Unnamed: 0_level_0,NT_BaselineP_Value,ChangeinExpression_P_Value_LIHC,ChangeinExpression_P_Value_BRCA,TumourExpression_Comparison
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a100130426,,,0.373901,0.373901
a100133144,0.405483,0.331431,0.003710,0.669774
a100134869,0.579891,0.004399,0.004384,0.008075
a10357,0.068195,0.437173,0.271066,0.962099
a10431,0.094029,0.665620,0.925098,0.126895
...,...,...,...,...
ZYX|7791,0.323060,0.235732,0.969213,0.533319
ZZEF1|23140,0.007871,0.000935,0.388805,0.451793
ZZZ3|26009,0.152018,0.350447,0.921093,0.701272
psiTPTE22|387590,0.003952,0.655799,0.006115,0.224457


In [None]:
ColourCode = []
for i in range(0,len(FinalAnalysisData.TumourExpression_Comparison)):    #Please exceuse if this is strange syntax. Initially tried: "for i in FinalAnalysisData.TumourExpression_Comparison:" Didn't work.
  if FinalAnalysisData.TumourExpression_Comparison[i] < 0.1:
    ColourCode.append('HighCorr')
  elif FinalAnalysisData.TumourExpression_Comparison[i] > 0.9:
    ColourCode.append('LowCorr')
  elif FinalAnalysisData.TumourExpression_Comparison[i] == np.NaN:
    ColourCode.append('NotInteresting_Corr')
  else:
    ColourCode.append('NotInteresting_Corr')

for i in range(0,len(NT_Baseline.NT_BaselineP_Value)): 
  if NT_Baseline.NT_BaselineP_Value[i] < 0.9:
    ColourCode[i] = 'NotInteresting_Baseline'
ColourCode = pd.DataFrame(ColourCode, columns = ['ColourCode'])
ColourCode.index = GeneExpressionNum.index
ColourCode
FinalAnalysisDataC = pd.concat([FinalAnalysisData, ColourCode], join='outer', axis =1)
FinalAnalysisDataC


Unnamed: 0_level_0,NT_BaselineP_Value,ChangeinExpression_P_Value_LIHC,ChangeinExpression_P_Value_BRCA,TumourExpression_Comparison,ColourCode
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a100130426,,,0.373901,0.373901,NotInteresting_Corr
a100133144,0.405483,0.331431,0.003710,0.669774,NotInteresting_Baseline
a100134869,0.579891,0.004399,0.004384,0.008075,NotInteresting_Baseline
a10357,0.068195,0.437173,0.271066,0.962099,NotInteresting_Baseline
a10431,0.094029,0.665620,0.925098,0.126895,NotInteresting_Baseline
...,...,...,...,...,...
ZYX|7791,0.323060,0.235732,0.969213,0.533319,NotInteresting_Baseline
ZZEF1|23140,0.007871,0.000935,0.388805,0.451793,NotInteresting_Baseline
ZZZ3|26009,0.152018,0.350447,0.921093,0.701272,NotInteresting_Baseline
psiTPTE22|387590,0.003952,0.655799,0.006115,0.224457,NotInteresting_Baseline


Finally, to remove all the unplottable points.

In [None]:
i = 6
FinalAnalysisDataC.iloc[i]

NT_BaselineP_Value                                0.029838
ChangeinExpression_P_Value_LIHC                   0.003949
ChangeinExpression_P_Value_BRCA                   0.040172
TumourExpression_Comparison                       0.070544
ColourCode                         NotInteresting_Baseline
Name: a155060, dtype: object

In [None]:
FinalAnalysisDataCC = FinalAnalysisDataC.dropna()
FinalAnalysisDataCC

Unnamed: 0_level_0,NT_BaselineP_Value,ChangeinExpression_P_Value_LIHC,ChangeinExpression_P_Value_BRCA,TumourExpression_Comparison,ColourCode
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a100133144,0.405483,0.331431,0.003710,0.669774,NotInteresting_Baseline
a100134869,0.579891,0.004399,0.004384,0.008075,NotInteresting_Baseline
a10357,0.068195,0.437173,0.271066,0.962099,NotInteresting_Baseline
a10431,0.094029,0.665620,0.925098,0.126895,NotInteresting_Baseline
a155060,0.029838,0.003949,0.040172,0.070544,NotInteresting_Baseline
...,...,...,...,...,...
ZYX|7791,0.323060,0.235732,0.969213,0.533319,NotInteresting_Baseline
ZZEF1|23140,0.007871,0.000935,0.388805,0.451793,NotInteresting_Baseline
ZZZ3|26009,0.152018,0.350447,0.921093,0.701272,NotInteresting_Baseline
psiTPTE22|387590,0.003952,0.655799,0.006115,0.224457,NotInteresting_Baseline


In [None]:
#for i in range(0,len(NT_Baseline.NT_BaselineP_Value)): 
#  if NT_Baseline.NT_BaselineP_Value[i] < 0.9:
#    FinalAnalysisDataCC.iloc[i].drop()

#FinalAnalysisDataCC

ValueError: ignored

Plot.

In [None]:
import plotly.express as px

fig = px.scatter_3d(FinalAnalysisDataCC, x=FinalAnalysisDataCC.NT_BaselineP_Value, y=FinalAnalysisDataCC.ChangeinExpression_P_Value_LIHC, z=FinalAnalysisDataCC.ChangeinExpression_P_Value_BRCA, color=FinalAnalysisDataCC.ColourCode, labels = np.column_stack)
fig.update_layout(height=1000, width=1000, title_text='Simple Plot of Compared P-Values')
fig.update_layout(xaxis_range=[0.9,1])
fig.show()

My expectation was to see a clearer separation between green and purple groups. That is: A fairly tight grouping of each. I find it somewhat counter-intitive that the purple dots, indicating low correlation, are not mostly found along either x or y axis. I would've expected this, as it would've indicated a large change in expression in one group, but not in the other. Instead, the green group, indicating high correlation, is to be found along this either axis. This counterintuitive result may be due to a plotting error, or due to the increased variance between patients as a gene becomes highly expressed. It's possible, however, that it holds an insight into the general nature of the data.

For example, toward the far end of both x and y axis, there are exclusively low correlation points. That is: The gene's expression changes significantly in both sets, but in a way which is not strongly correlated. No genes which seem to function similarly in both cancers. Further analysis of this particular genes may be of use in figuringout whether this has any significance.