# Exercice semaine 5: Analyse de données avec un grand nombre de paramètres

Dans cet exercice, nous allons faire une analyse non supervisée de l'expression génique de la cohorte TGCA_SKCM. Pour cela, nous allons utiliser des méthodes de réduction de dimension linéraire (Analyse en Composante Principale) et non linéaire (t-SNE,UMAP) dans le but de représenter l'expression des milliers de gènes de chaque échantillon dans des espaces réduits à quelques dimensions qui peuvent être visualisés.

## Chargement des libraries

In [3]:
import pandas as pd
import plotly.express as px
import numpy as np
import os
from scipy import stats
#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Chargement des données

Nous allons travailler sur le fichier d'expression génique de la cohorte de patients atteints de mélanome que nous avons normalisé en TPM (`'TCGA-SKCM.htseq_tpm.csv'`) il y a deux semaines et nous aurons également besoin du fichier d'annotations des gènes (`'gencode.v22.annotation.gene.probeMap.with.length.csv'`).

Aujourd'hui, nous utiliserons également le fichier des données cliniques des patients de la cohorte (`'TCGA-SKCM.GDC_phenotype.tsv'`) que nous avons téléchargé à cette adresse: 
https://xenabrowser.net/datapages/?dataset=TCGA-SKCM.GDC_phenotype.tsv&host=https%3A%2F%2Fgdc.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443


Nous avons mis à votre disposition ces fichiers dans le dossier UCSC_data.

## Question 1: Préparation des données

1) Charger ces fichiers dans des tableaux `smpCounts`, `geneAnnotation` et `clinicalData`. Mettez en index (noms des lignes) la colonne `Ensembl_ID` pour `rawSmpCounts`, `ìd` pour `geneAnnotations` et la colonne `sample`pour `survivalData`. On conservera la colonne d'index pour `geneAnnotations` et `clinicalData` (option `drop=False`).

Le fichier `TCGA-SKCM.htseq_counts.tsv`se termine par 5 lignes de statistiques par échantillon qui sont à retirer.

In [4]:
smpCounts = pd.read_csv('../UCSC_data/TCGA-SKCM.htseq_tpm.csv')
smpCounts = smpCounts.set_index('Ensembl_ID') #his will drop Ensembl_ID column
smpCounts

Unnamed: 0_level_0,TCGA-EE-A2GJ-06A,TCGA-EE-A2GI-06A,TCGA-WE-A8ZM-06A,TCGA-DA-A1IA-06A,TCGA-D3-A51H-06A,TCGA-XV-A9VZ-01A,TCGA-FS-A1ZE-06A,TCGA-D3-A51F-06A,TCGA-D3-A8GL-06A,TCGA-BF-A5EP-01A,...,TCGA-EE-A2GP-06A,TCGA-EE-A2M6-06A,TCGA-EE-A3AA-06A,TCGA-FS-A1ZF-06A,TCGA-D9-A6EC-06A,TCGA-FR-A8YC-06A,TCGA-EB-A4XL-01A,TCGA-EB-A551-01A,TCGA-EE-A3J4-06A,TCGA-EE-A3AC-06A
Ensembl_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.13,5.492020,4.977245,5.686251,4.757404,2.896276,5.301672,3.461976,3.831496,3.257634,4.518998,...,5.462907,5.715717,2.391869,4.806537,5.116857,4.291547,5.300217,3.839361,4.953076,4.828908
ENSG00000000005.5,0.000000,0.080714,1.527938,0.000000,0.231093,0.950625,0.622636,0.050011,0.214703,0.048674,...,0.416431,0.222035,0.072866,0.153887,0.000000,0.246848,0.000000,0.000000,0.000000,0.000000
ENSG00000000419.11,5.523713,6.880899,6.772244,4.593481,5.942068,6.304230,6.840104,6.058202,7.133527,6.205951,...,6.885842,7.098479,7.013252,6.332303,7.159867,5.630500,5.661585,6.422101,5.906909,6.932407
ENSG00000000457.12,3.581697,3.684573,3.381051,2.580018,3.060895,3.472246,3.463646,2.999380,3.856717,2.408740,...,3.003973,2.610218,2.801962,3.428713,2.632950,3.355024,2.838872,2.727272,2.905733,3.779931
ENSG00000000460.15,3.514461,3.646168,2.500569,3.305690,2.613756,2.390197,2.825826,2.642469,3.841492,2.106094,...,3.156954,3.431138,3.542370,3.990699,2.928625,2.021066,2.125280,2.027202,2.524369,3.771591
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSGR0000275287.3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ENSGR0000276543.3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ENSGR0000277120.3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ENSGR0000280767.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [5]:
geneAnnotations = pd.read_table('../UCSC_data/genes_info.txt',sep = ' ')
geneAnnotations = geneAnnotations.set_index("ensembl_gene_id",drop=False)
geneAnnotations

Unnamed: 0_level_0,ensembl_gene_id,hgnc_symbol,gene_length
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000223972.5,ENSG00000223972.5,DDX11L1,1735
ENSG00000227232.5,ENSG00000227232.5,WASH7P,1351
ENSG00000278267.1,ENSG00000278267.1,MIR6859-3,68
ENSG00000243485.3,ENSG00000243485.3,RP11-34P13.3,1021
ENSG00000274890.1,ENSG00000274890.1,MIR1302-9,138
...,...,...,...
ENSG00000198695.2,ENSG00000198695.2,MT-ND6,525
ENSG00000210194.1,ENSG00000210194.1,MT-TE,69
ENSG00000198727.2,ENSG00000198727.2,MT-CYB,1141
ENSG00000210195.2,ENSG00000210195.2,MT-TT,66


In [6]:
clinicalData = pd.read_table('../UCSC_data/TCGA-SKCM.GDC_phenotype.tsv')
clinicalData = clinicalData.set_index("submitter_id.samples",drop=False)
clinicalData

Unnamed: 0_level_0,submitter_id.samples,age_at_initial_pathologic_diagnosis,batch_number,bcr,bcr_followup_barcode,bcr_followup_uuid,submitter_id,breslow_depth_value,day_of_dcc_upload,day_of_form_completion,...,days_to_collection.samples,days_to_sample_procurement.samples,initial_weight.samples,is_ffpe.samples,oct_embedded.samples,preservation_method.samples,sample_type.samples,sample_type_id.samples,state.samples,tissue_type.samples
submitter_id.samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-D9-A4Z2-01A,TCGA-D9-A4Z2-01A,50.0,262.74.0,Nationwide Children's Hospital,TCGA-D9-A4Z2-F57868,AB5F6B14-DAAE-4D19-B227-EC6E2D5ED0D0,TCGA-D9-A4Z2,25.0,14,28.0,...,66.0,,160.0,False,False,,Primary Tumor,1,released,Not Reported
TCGA-ER-A2NH-06A,TCGA-ER-A2NH-06A,49.0,180.87.0,Nationwide Children's Hospital,TCGA-ER-A2NH-F69370,CF52E6AD-959C-446F-B88F-E7E73B83E405,TCGA-ER-A2NH,4.0,14,7.0,...,744.0,,,False,True,,Metastatic,6,released,Not Reported
TCGA-BF-A5EO-01A,TCGA-BF-A5EO-01A,65.0,291.69.0,Nationwide Children's Hospital,TCGA-BF-A5EO-F68880,9993CEA1-92A4-48E7-904E-542F5079A3C6,TCGA-BF-A5EO,8.0,14,23.0,...,134.0,,14.0,False,True,,Primary Tumor,1,released,Not Reported
TCGA-D9-A6EA-06A,TCGA-D9-A6EA-06A,70.0,316.66.0,Nationwide Children's Hospital,TCGA-D9-A6EA-F57860,C0168BC0-E336-4944-86D9-C332EA962EC0,TCGA-D9-A6EA,6.0,14,28.0,...,478.0,,270.0,False,True,,Metastatic,6,released,Not Reported
TCGA-D9-A4Z3-01A,TCGA-D9-A4Z3-01A,73.0,277.76.0,Nationwide Children's Hospital,TCGA-D9-A4Z3-F58483,54923FD3-981A-4649-BEDD-860A06350154,TCGA-D9-A4Z3,75.0,14,11.0,...,16.0,,170.0,False,False,,Primary Tumor,1,released,Not Reported
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-D3-A1Q9-06A,TCGA-D3-A1Q9-06A,72.0,180.87.0,Nationwide Children's Hospital,,,TCGA-D3-A1Q9,6.0,14,,...,2065.0,,490.0,False,True,,Metastatic,6,released,Not Reported
TCGA-FS-A1ZP-06A,TCGA-FS-A1ZP-06A,52.0,180.87.0,Nationwide Children's Hospital,,,TCGA-FS-A1ZP,2.5,14,,...,4897.0,,190.0,False,False,,Metastatic,6,released,Not Reported
TCGA-EB-A42Y-01A,TCGA-EB-A42Y-01A,73.0,262.74.0,Nationwide Children's Hospital,TCGA-EB-A42Y-F45845,4802A84B-7396-463C-82B0-24BF85B44EFF,TCGA-EB-A42Y,5.0,14,23.0,...,99.0,,650.0,False,True,,Primary Tumor,1,released,Not Reported
TCGA-WE-A8ZY-06A,TCGA-WE-A8ZY-06A,62.0,388.48.0,Nationwide Children's Hospital,TCGA-WE-A8ZY-F67332,C8CE9B61-BA13-4E16-9639-4464707E3A18,TCGA-WE-A8ZY,3.0,14,4.0,...,1294.0,,510.0,False,True,,Metastatic,6,released,Not Reported


2) Comme lors de la troisième séance, ne conservez que les données de patientes pour lesquelles on a à la fois les données cliniques et l'expression génique.

In [7]:
clinicalData = clinicalData[clinicalData["gender.demographic"]=="female"]
patientID  = smpCounts.columns.intersection(clinicalData["submitter_id.samples"])
clinicalData = clinicalData.loc[patientID]
smpCounts = smpCounts[patientID]

Nous allons réduire la dimension du jeux de données en travaillant avec les gènes dont l'expression varie le plus.

5) Selectionner les 500 gènes avec les plus grandes variances dans le jeux de données normalisées.

In [8]:
nGenes = 500
smpCountVarGenes = smpCounts.loc[smpCounts.var(numeric_only = True, axis = 1).sort_values(ascending=False).index[:nGenes]]
smpCountVarGenes

Unnamed: 0_level_0,TCGA-DA-A1IA-06A,TCGA-XV-A9VZ-01A,TCGA-BF-A5EP-01A,TCGA-EE-A3AF-06A,TCGA-D9-A3Z3-06A,TCGA-FS-A1ZG-06A,TCGA-EB-A44P-01A,TCGA-FR-A729-06A,TCGA-D3-A5GR-06A,TCGA-FS-A4FC-06A,...,TCGA-D9-A4Z3-01A,TCGA-EB-A41B-01A,TCGA-D3-A3CF-06A,TCGA-FS-A4F0-06A,TCGA-EE-A2GM-06B,TCGA-D3-A8GN-06A,TCGA-D3-A1Q1-06A,TCGA-FS-A1ZF-06A,TCGA-EB-A4XL-01A,TCGA-EB-A551-01A
Ensembl_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000107165.11,7.066218,6.799930,3.425791,0.811460,10.288992,14.396544,1.852564,6.336925,0.058561,7.133724,...,0.000000,9.609366,9.958414,11.825504,7.259458,8.247886,0.134844,7.298838,9.170576,0.743548
ENSG00000186847.5,1.408282,12.571299,1.992407,0.742395,0.162815,0.032844,8.230878,1.511560,0.094834,10.378933,...,0.032091,0.926994,1.180265,2.542369,2.160705,0.000000,0.096080,0.353974,12.194867,2.524297
ENSG00000211598.2,2.712010,0.180038,0.496454,4.525218,11.041748,3.805753,6.204527,10.513391,11.780401,5.829377,...,2.267413,1.516865,11.295832,2.619667,2.377095,13.134208,0.663521,0.418840,5.049529,10.846007
ENSG00000211896.5,3.982768,0.445571,2.964609,7.451168,13.441734,5.499959,7.532357,12.551222,12.884868,8.850685,...,4.373067,4.040132,11.879925,4.688568,3.560111,14.899345,2.468558,0.884810,9.064013,13.119988
ENSG00000185664.13,14.729082,8.590050,12.376209,14.049126,14.049643,14.760182,12.729566,9.916614,6.750454,12.760053,...,1.597135,12.983562,14.331738,11.955349,13.738047,11.020641,13.626196,11.649660,12.046421,2.091082
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000124785.7,5.565315,6.551427,2.486387,5.934527,2.555172,6.978734,5.753652,7.334014,5.969634,6.580273,...,6.970748,2.243745,3.059803,6.360357,4.538002,2.077676,1.824378,2.473530,2.513193,3.297163
ENSG00000133063.14,0.406450,2.905731,0.089934,6.436339,4.112601,1.807551,0.920347,1.682338,1.855875,2.993947,...,0.332111,0.401105,0.690886,0.325771,1.369335,1.262195,0.148450,0.351033,0.320049,0.706487
ENSG00000134363.10,0.145230,4.813498,1.468679,0.520335,2.021298,1.018676,4.751289,2.647154,5.988663,1.704464,...,2.977774,2.485062,3.048075,2.025180,0.430671,3.570706,0.318345,4.216814,3.678826,1.034542
ENSG00000117122.12,4.201856,6.888783,5.401808,5.852151,5.308334,5.518212,5.513435,4.635008,3.865416,3.385027,...,5.987092,6.183950,6.809981,0.838038,6.034502,3.073416,6.444495,5.801871,4.981610,2.546539


In [14]:
px.histogram(smpCountVarGenes.loc["ENSG00000107165.11"])

In [19]:
geneAnnotations[geneAnnotations["ensembl_gene_id"]== "ENSG00000107165.11"]

Unnamed: 0_level_0,ensembl_gene_id,hgnc_symbol,gene_length
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000107165.11,ENSG00000107165.11,TYRP1,3741


6) Centrer-réduire les données.

In [9]:
# This scales each column to have mean=0 and standard deviation=1
SS=StandardScaler()

# Apply scaling
X=pd.DataFrame(SS.fit_transform(smpCountVarGenes.T), columns=smpCountVarGenes.index,index = smpCountVarGenes.columns)

In [11]:
X

Ensembl_ID,ENSG00000107165.11,ENSG00000186847.5,ENSG00000211598.2,ENSG00000211896.5,ENSG00000185664.13,ENSG00000211592.5,ENSG00000205420.9,ENSG00000211663.2,ENSG00000211945.2,ENSG00000143556.7,...,ENSG00000166426.7,ENSG00000112378.11,ENSG00000134184.11,ENSG00000197084.5,ENSG00000173369.14,ENSG00000124785.7,ENSG00000133063.14,ENSG00000134363.10,ENSG00000117122.12,ENSG00000253490.4
TCGA-DA-A1IA-06A,0.390877,-0.398011,-0.955572,-1.200193,1.095044,-1.227931,-0.295801,-1.276684,-1.214068,-0.521601,...,-0.519801,-0.020717,-0.276053,0.118980,-0.589385,0.516042,-0.717069,-1.217006,-0.070111,-0.267009
TCGA-XV-A9VZ-01A,0.328488,2.415898,-1.598577,-2.105253,-0.489208,-2.074441,1.194989,-1.445291,-1.483807,1.210752,...,1.026958,2.154123,-0.924061,1.926546,-2.628643,1.024001,0.571374,1.189813,1.315535,-0.991032
TCGA-BF-A5EP-01A,-0.462055,-0.250768,-1.518222,-1.460709,0.487857,-1.494984,-0.325839,-0.719085,-1.318710,-0.460276,...,0.251203,0.264463,-0.883847,-0.389372,-0.880849,-1.069953,-0.880241,-0.534675,0.548703,0.120848
TCGA-EE-A3AF-06A,-1.074578,-0.565864,-0.495100,-0.312736,0.919574,-0.564975,-0.550987,-0.286662,-0.142520,-0.579182,...,-0.628640,-1.043635,1.706290,-0.326319,0.749161,0.706228,2.391493,-1.023613,0.780945,-1.041245
TCGA-D9-A3Z3-06A,1.145956,-0.711961,1.159802,1.220067,0.919707,1.240283,-0.678486,1.533256,1.179922,-0.621162,...,1.411795,-0.605819,1.266803,-0.216464,0.516152,-1.034521,1.193547,-0.249761,0.500499,0.860594
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-D3-A8GN-06A,0.667736,-0.753002,1.691191,1.593025,0.138036,1.775906,-0.678486,0.938100,1.993009,-0.443380,...,-0.628640,-1.821943,0.327759,-0.389372,0.374057,-1.280485,-0.275910,0.549067,-0.652047,0.779289
TCGA-D3-A1Q1-06A,-1.233105,-0.728783,-1.475795,-1.587633,0.810431,-1.699547,-0.627650,-1.351817,-1.396698,-0.287545,...,2.235320,-2.360811,-0.663122,-0.389372,-0.756512,-1.410962,-0.850074,-1.127753,1.086416,-1.081388
TCGA-FS-A1ZF-06A,0.445379,-0.663775,-1.537933,-1.992866,0.300362,-1.980321,-0.495318,-1.363151,-1.483807,-0.462371,...,-0.283778,-0.809421,-0.601248,-0.389372,-0.790863,-1.076576,-0.745638,0.882181,0.755015,1.239738
TCGA-EB-A4XL-01A,0.883917,2.321009,-0.361949,0.099942,0.402751,0.102306,2.379366,-0.168708,-0.175530,2.425931,...,0.311045,2.110205,-0.682683,3.616289,-0.753031,-1.056145,-0.761611,0.604810,0.332007,0.002013


## Question 2: Analyse en composantes principales (Principal Component Analyzis PCA) 

Réaliser une analyse en composantes principales (Principal Component Analyzis PCA) pour réduire la dimension du jeux de données. 

1) Calculer 30 composantes principales (choix totalement arbitraire, habituellement 10 à 50 composantes peuvent être calculées).



In [10]:
nPCs = 30
# PCA
pca = PCA(n_components=nPCs)
X_pca = pca.fit_transform(X)

# Convert to data frame
principal_df = pd.DataFrame(data = X_pca)
principal_df.index = X.index
principal_df.head()



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
TCGA-DA-A1IA-06A,-13.201208,-5.332833,-7.120047,-1.688675,0.106868,-6.870514,3.645978,-0.710316,3.229174,1.46935,...,-2.132045,1.529367,-1.497709,-1.505799,2.972959,0.779065,-3.092757,1.348923,0.812005,-0.161878
TCGA-XV-A9VZ-01A,-17.009087,8.746525,2.600618,9.461233,-1.790241,6.297873,3.81605,-3.001654,-3.127924,3.725227,...,0.009316,0.462878,1.042277,-1.447659,-0.367636,-2.307253,-1.015497,-0.346442,2.988078,-2.383099
TCGA-BF-A5EP-01A,-12.765447,-7.444606,-1.124032,0.078364,3.156299,-6.103066,-1.354943,-1.482552,-0.758679,-1.916549,...,1.152907,0.045436,-1.174033,-1.366333,-0.005596,-1.386545,0.412377,-2.804726,1.160685,2.639679
TCGA-EE-A3AF-06A,-2.347926,-2.942148,-7.631831,1.997052,5.818364,0.088144,-0.526447,-1.693006,-0.760237,-2.52918,...,-2.875763,-0.899686,-1.876233,0.013396,-0.390106,-0.087148,-2.569995,0.761482,-2.073177,0.889119
TCGA-D9-A3Z3-06A,14.763371,2.475197,-6.87745,-2.667754,-4.05538,-8.30718,2.777289,1.453846,-4.126435,1.171977,...,1.88009,-1.741279,0.205473,2.062424,-0.782809,-0.323229,-0.482702,-1.582687,0.985227,1.848135


2) Afficher la proportion de variance expliquée par chacune des composantes calculées (elbow plot). Que remarquez-vous?

In [None]:
xi = np.arange(1, 1+nPCs, step=1)
yi = pca.explained_variance_ratio_

df = pd.DataFrame(list(zip(xi, yi)),
               columns =['component', 'propVarExp'])

px.scatter(df, x="component", y="propVarExp")

3) Afficher les patients dans l'espace formé par les deux premères composantes. Puis les composantes 2 et 3.

Vous pouvez colorer les points selon le type d'echantillons et la localisation de la tumeur.

In [None]:
principal_df2 = pd.concat([principal_df,clinicalData],axis = 1)
fig = px.scatter(principal_df2, x=0, y=1, color="sample_type.samples")
fig.show()

In [None]:
principal_df2 = pd.concat([principal_df,clinicalData],axis = 1)
fig = px.scatter(principal_df2, x=1, y=2, color="sample_type.samples")
fig.show()

In [None]:
principal_df2 = pd.concat([principal_df,clinicalData],axis = 1)
fig = px.scatter(principal_df2, x=0, y=1, color="submitted_tumor_location")
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

## Question 2 : t-SNE et UMAP à partir des composantes calculées de l'ACP

les algoritmes t-SNE et UMAP sont deux méthodes de réduction de dimension non linéaires qui sont couramment utilisées pour visualiser les différentes composantes de l'ACP en 2 ou 3 dimensions.

1) Réalisez un t-SNE avec 2 dimensions à partir de 10 composantes principales de l'ACP calculées. Le choix d'utiliser 10 composantes de l'ACP s'appuie sur la chute de variance observée (elbow plot) mais reste lui aussi très arbitraire.

In [None]:
nPCsecRed = 10
# t-SNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(principal_df.iloc[:,0:nPCsecRed -1],)

# Convert to data frame
tsne_df = pd.DataFrame(data = X_tsne, columns = ['tsne comp. 1', 'tsne comp. 2'])
tsne_df.index = X.index
tsne_df

2) Comme précedemment, afficher les patientes dans cet espace et colorer les points selon le type d'echantillons et la localisation de la tumeur

In [None]:
tsne_df2 = pd.concat([tsne_df,clinicalData],axis = 1)
fig = px.scatter(tsne_df2, x="tsne comp. 1", y="tsne comp. 2", color="sample_type.samples")
fig.show()

In [None]:
fig = px.scatter(tsne_df2, x="tsne comp. 1", y="tsne comp. 2", color="submitted_tumor_location")
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

3) Réaliser maintenant un UMAP avec 2 dimensions à partir des 10 premières composantes principales de l'ACP calculées.

In [None]:
# UMAP
um = umap.UMAP()
um.fit(principal_df)
X_umap = um.transform(principal_df)

# Convert to data frame
umap_df = pd.DataFrame(data = X_umap, columns = ['umap comp. 1', 'umap comp. 2'])

# Shape and preview
print(umap_df.shape)
umap_df.head()
umap_df.index = clinicalData.index
umap_df


4) Afficher les patientes dans cet espace et colorer les points selon le type d'echantillon et la localisation de la tumeur.

In [None]:
umap_df2 = pd.concat([umap_df,clinicalData],axis = 1)
fig = px.scatter(umap_df2, x="umap comp. 1", y="umap comp. 2",color="sample_type.samples")
fig.show()

In [None]:
fig = px.scatter(umap_df2, x="umap comp. 1", y="umap comp. 2",color="submitted_tumor_location")
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

5) Colorer les patientes selon l'expression de gènes différentiellement exprimés entre les différents types d'echantillons trouvés lors de la troisième séance.

In [None]:
umap_df2["ENSG00000211899.6"] = smpCounts.loc["ENSG00000211899.6"] # IGHM
fig = px.scatter(umap_df2, x="umap comp. 1", y="umap comp. 2",color="ENSG00000211899.6")
fig.show()

In [None]:
umap_df2["ENSG00000092295.10"] = smpCounts.loc["ENSG00000092295.10"] # TGM1
fig = px.scatter(umap_df2, x="umap comp. 1", y="umap comp. 2",color="ENSG00000092295.10")
fig.show()