#Práctico 1: Introducción a Python para el análisis de datos biológicos

El siguiente práctico tiene como objetivo introducir algunas de las herramientas disponibles en Python para el análisis de datos, particularmente datos biológicos.

El objetivo del curso es hacer llegar al estudiante métodos y procedimientos que son utilizados en la investigación científica, es por ello que durante los prácticos del curso replicaremos algunos de los procedimientos realizados en artículos seleccionados. Siendo este el primer práctico.

El día de hoy comenzaremos con la manipulación de datos con la librería Pandas, particularmente datos de célula única con la librería Scanpy, así como una librería propia que permite la descarga de datos de Flybase, FlyBaseDownloads.

Si bien es cierto dos de las librerías mencionadas previamente son específicas para sus respectivos campos, la manipulación que haremos con Pandas es estándar para la mayoría de los trabajos de aprendizaje automático que hacen uso de la librería.

In [None]:
#instalar librerías
!pip install scanpy #Single-cell analysis in Python
!pip install FlyBaseDownloads #librería de autoría propia. Permite descargar los datos de FB directamente en Python

Collecting scanpy
  Downloading scanpy-1.9.6-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting anndata>=0.7.4 (from scanpy)
  Downloading anndata-0.10.3-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.2/119.2 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting session-info (from scanpy)
  Downloading session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn>=0.3.10 (from scanpy)
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting array-api-compat (from anndata>=0.7.4->scanpy)
  Downloading array_api_compat-1.4-py3-none-any.whl (29 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.3.10->scanp

In [None]:
#Importar librerias necesarias
from google.colab import drive #importar desde drive a colab
import scanpy as sc
import numpy as np
import pandas as pd
import FlyBaseDownloads as FBD

In [None]:
drive.mount('/content/drive') #acceden a sus documentos de drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



##Artículo: FlyCellAtlas
##doi: https://doi.org/10.1126/science.abk2432

Secuenciación a nivel núcleo de diferentes tejidos en Drosophila melanogaster. A diferencia de trabajos anteriores, FCA tiene como objetivo generar un banco de datos de moscas con el mismo background genético, protocolos de tratamiento y plataforma de secuenciación.
Se realizó con la unión de 40 laboratorios. Se diseccionan 15 tejidos, además de la secuenciación de cuerpo y cabeza completas.


In [None]:
#Primera actividad: cargar los datos descargados de https://flycellatlas.org/

#Datos Fatbody SmartSeq2
fatbody_r =  '/content/drive/MyDrive/Curso/Practico 1|2/Datos/fatbody.h5ad'
fatbody_ds = sc.read_h5ad(fatbody_r)

In [None]:
#2 Metadatos
print(fatbody_ds.obs.head)

<bound method NDFrame.head of                               sex           transf_annotation
FCA_P31_male_fatbody_A12     male              adult fat body
FCA_P31_male_fatbody_A13     male              adult fat body
FCA_P31_male_fatbody_A14     male              adult fat body
FCA_P31_male_fatbody_A15     male                 muscle cell
FCA_P31_male_fatbody_A16     male  female reproductive system
...                           ...                         ...
FCA_P33_female_fatbody_P5  female              adult fat body
FCA_P33_female_fatbody_P6  female              adult fat body
FCA_P33_female_fatbody_P7  female                     unknown
FCA_P33_female_fatbody_P8  female              adult fat body
FCA_P33_female_fatbody_P9  female  female reproductive system

[1349 rows x 2 columns]>


In [None]:
#3 Obtenemos la matriz de celulas x genes en formato DataFrame
fatbody_df = fatbody_ds.to_df()
print(fatbody_df.head)

<bound method NDFrame.head of                            7SLRNA:CR32864          a     abd-A      Abd-B  \
FCA_P31_male_fatbody_A12              0.0   0.000000  0.000000   0.000000   
FCA_P31_male_fatbody_A13              0.0   0.000000  0.000000   0.000000   
FCA_P31_male_fatbody_A14              0.0   0.000000  0.000000   0.000000   
FCA_P31_male_fatbody_A15              0.0   0.000000  4.919354  10.302004   
FCA_P31_male_fatbody_A16              0.0   0.000000  4.699677   5.671648   
...                                   ...        ...       ...        ...   
FCA_P33_female_fatbody_P5             0.0   7.251909  0.000000   0.000000   
FCA_P33_female_fatbody_P6             0.0  11.497359  0.000000   0.000000   
FCA_P33_female_fatbody_P7             0.0   3.671775  0.000000   0.000000   
FCA_P33_female_fatbody_P8             0.0   0.000000  0.000000   0.000000   
FCA_P33_female_fatbody_P9             0.0   0.000000  0.000000   0.000000   

                                 Abl       ab

In [None]:
#Cantidad de genes
print('Cantidad de genes disponible: ', len(fatbody_df.columns))

#Cantidad de células
print('Cantidad de células: ', len(fatbody_df.index))

Cantidad de genes disponible:  16311
Cantidad de células:  1349


In [None]:
# Separar por sexos
# De acuerdo a los metadatos que contiene el archivo

muestras = fatbody_ds.obs.index
sexos = fatbody_ds.obs['sex'].to_numpy()
dict_bysex = {}

# Itera a través de las listas y agrega los valores a las claves correspondientes
for clave, valor in zip(sexos, muestras):
    dict_bysex.setdefault(clave, []).append(valor)

# Esto fácilmente podría ser un diccionario con la clase de cada muestra

# El resultado será un diccionario donde las claves pueden tener múltiples valores
print(dict_bysex.keys())

dict_keys(['male', 'female'])


In [None]:
#Dataframe por sexos
fatbody_male = fatbody_df.loc[dict_bysex['male']]
fatbody_female = fatbody_df.loc[dict_bysex['female']]
#fatbody_female.head

In [None]:
# Normalizar los datos con la función Min-Max.
# De acuerdo a lo que vieron en el teórico, programe la función de normalización Min-Max
# Función propia
def normalize_column(column):
    # Normalizar la columna usando la fórmula Min-Max
    normalized_column = (column - column.min()) / (column.max() - column.min())
    return normalized_column

fatbody_df_norm_01 = fatbody_df.apply(normalize_column)  # Pandas permite aplicar directamente funciones a las columnas
# print(fatbody_df_norm_01)


In [None]:
#Normalizar los datos con la función Min-Max. Sklearn
#con sklearn
from sklearn.preprocessing import MinMaxScaler

# MinMaxScaler con rango de deseado
scaler = MinMaxScaler(feature_range=(0, 1))

# Ajusta el scaler a los datos del DataFrame
scaler.fit(fatbody_df)

# Normaliza el Dataframe
fatbody_df_norm_01_sk = pd.DataFrame(scaler.transform(fatbody_df), columns=fatbody_df.columns)

# Existen diferencias? Por que?
print(list(fatbody_df_norm_01_sk.iloc[0, :]))
print(list(fatbody_df_norm_01.iloc[0, :]))

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3178495466709137, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.47258615493774414, 0.5106295347213745, 0.5450917482376099, 0.0, 0.0, 0.0, 0.6075025796890259, 0.0, 0.0, 0.189059779047966, 0.0, 0.0, 0.3128049373626709, 0.0, 0.0, 0.0, 0.24323830008506775, 0.49299725890159607, 0.0, 0.0, 0.7216368913650513, 0.0, 0.0, 0.7563135027885437, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.41033169627189636, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.179225891828537, 0.0, 0.0, 0.0, 0.0, 0.15470494329929352, 0.0, 0.0, 0.0, 0.0, 0.0, 0.307686984539032, 0.0, 0.0, 0.29081854224205017, 0.0, 0.4049190580844879, 0.0, 0.2667633891105652, 0.5358862280845642, 0.0, 0.306000679731369, 0.45025917887687683, 0.0, 0.4362989366054535, 0.3327025771141052, 0.42544320225715637, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5167677402496338, 0.0, 0.6912381052970886, 0.251437783241272, 0.0, 0.0, 0.3555324077606201, 0.30905309319496155, 0.0, 0.0, 0.0, 0.0, 0.30088678002357483, 0.0, 0.0, 0.0, 0.0,

In [None]:
#Separar por genes que codifican proteinas -> integración con otras fuentes de datos
Unique_protein_isoforms = FBD.Genes.Unique_protein_isoforms()
#print(Unique_protein_isoforms.head)
gene_prot = Unique_protein_isoforms['FB_gene_symbol']

cg_col_df = [col for col in gene_prot if col in fatbody_df.columns] #genes que tambien están presentes en las columnas del df fatbody_df

fatbody_prot_genes = fatbody_df.loc[:, cg_col_df] #selecciona las columnas (genes) anteriores con el método loc

print('Genes totales: ', len(fatbody_df.columns))
print('Genes que codifican proteinas: ', len(fatbody_prot_genes.columns))

#Por que dan error?

Genes totales:  16311
Genes que codifican proteinas:  20930


In [None]:
cg_col_df = list(set(gene_prot) & set(fatbody_df.columns)) #Selección de columnas con interseccion de conjuntos únicos

fatbody_prot_genes = fatbody_df.loc[:, cg_col_df]
print('Genes totales: ', len(fatbody_df.columns))
print('Genes que codifican proteinas: ', len(fatbody_prot_genes.columns))

Genes totales:  16311
Genes que codifican proteinas:  12825


In [None]:
#Separar por genes que NO codifican proteinas

Noncoding_RNAs = FBD.Genes.Noncoding_RNAs()
noncoding = Noncoding_RNAs['symbol'].tolist()
print(noncoding[:10])

nc_col_df = list(set(noncoding) & set(fatbody_df.columns))
print(nc_col_df)

['tRNA:Pro-CGG-1-1-RA', 'snoRNA:M-RA', 'lncRNA:CR33218-RC', 'lncRNA:CR33218-RD', 'tRNA:Gln-TTG-2-1-RA', 'tRNA:Gln-CTG-2-1-RA', 'tRNA:Pro-CGG-3-1-RA', 'lncRNA:roX1-RA', 'lncRNA:roX1-RB', 'lncRNA:roX1-RC']
[]


In [None]:
#Paso adicional de 'pre-procesamiento' -Como son mis datos v/s como son las otras fuentes de información
noncoding_filt = [gene.split('-R')[0] for gene in noncoding]
print(noncoding_filt[:10])
nc_col_df =list(set(noncoding_filt) & set(fatbody_df.columns))
print(nc_col_df)
print(len(nc_col_df))

['tRNA:Pro-CGG-1-1', 'snoRNA:M', 'lncRNA:CR33218', 'lncRNA:CR33218', 'tRNA:Gln-TTG-2-1', 'tRNA:Gln-CTG-2-1', 'tRNA:Pro-CGG-3-1', 'lncRNA:roX1', 'lncRNA:roX1', 'lncRNA:roX1']
['lncRNA:CR45054', 'asRNA:CR44293', 'lncRNA:CR45165', 'lncRNA:CR44301', 'lncRNA:CR45509', 'lncRNA:CR45037', 'lncRNA:CR45762', 'lncRNA:CR43357', 'lncRNA:CR42874', 'lncRNA:CR44058', 'snoRNA:Psi28S-1837b', 'lncRNA:CR43848', 'lncRNA:CR34024', 'asRNA:CR44653', 'lncRNA:CR45559', 'lncRNA:CR44496', 'lncRNA:CR44534', 'asRNA:CR44992', 'lncRNA:CR43363', 'lncRNA:CR46010', 'lncRNA:CR43949', 'snoRNA:Psi28S-1153', 'lncRNA:CR44156', 'lncRNA:CR44491', 'lncRNA:CR44292', 'asRNA:CR45175', 'lncRNA:CR46014', 'asRNA:CR46136', 'lncRNA:CR45532', 'lncRNA:CR45029', 'asRNA:CR45895', 'lncRNA:CR43713', 'lncRNA:CR44923', 'asRNA:CR43877', 'lncRNA:CR45169', 'snoRNA:Or-CD14', 'asRNA:CR45738', 'snoRNA:Me28S-A2589a', 'snoRNA:Psi28S-2626', 'lncRNA:CR44577', 'lncRNA:CR44953', 'lncRNA:CR44950', 'lncRNA:CR44309', 'snoRNA:Psi18S-525a', 'asRNA:CR45026', 'l

In [None]:
genes_no_ident = nc_col_df
genes_no_ident.extend(cg_col_df)
df_no_ident = fatbody_df.loc[:, ~fatbody_df.columns.isin(genes_no_ident)]

aux = []
for i in df_no_ident.columns:
  if i.startswith('C'):
    aux.append(i)
    #print(i)

print(len(aux))

In [None]:
df_no_ident

Unnamed: 0,Argk,Bsg25A,Bsg25D,Dl,Eip71CD,Eip55E,e(y)1,CG17716,h,Hlc,...,CR46418,CR46419,CR46420,CR46421,CR46422,CR46423,CR46424,CR46425,CR46432,CR46447
FCA_P31_male_fatbody_A12,0.000000,0.0,4.301766,0.000000,0.00000,6.6433,0.0,0.000000,6.026606,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P31_male_fatbody_A13,0.000000,0.0,10.671881,1.678184,0.00000,0.0000,0.0,0.000000,7.858144,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P31_male_fatbody_A14,0.000000,0.0,0.000000,0.000000,0.00000,0.0000,0.0,0.000000,6.862206,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P31_male_fatbody_A15,10.522108,0.0,0.000000,7.409896,0.00000,0.0000,0.0,0.000000,4.109003,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P31_male_fatbody_A16,0.000000,0.0,0.000000,4.699677,0.00000,0.0000,0.0,7.650262,8.646667,4.699677,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
FCA_P33_female_fatbody_P5,0.000000,0.0,0.000000,0.000000,0.00000,0.0000,0.0,2.332813,4.100260,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P33_female_fatbody_P6,0.000000,0.0,0.000000,0.000000,7.48811,0.0000,0.0,3.603875,7.014813,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P33_female_fatbody_P7,5.179223,0.0,0.000000,0.000000,0.00000,0.0000,0.0,4.614033,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FCA_P33_female_fatbody_P8,0.000000,0.0,0.000000,0.000000,0.00000,0.0000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#Para tratar de recuperar más genes vamos a comprar los valores en minuscula, ya que tal vez las anotaciones puedan variar en eso
#Separar por genes que codifican proteinas -> integración con otras fuentes de datos
#print(Unique_protein_isoforms.head)

gene_prot = Unique_protein_isoforms['FB_gene_symbol']

#paso a minúscula
cg_col_df = []

# Convierte ambas listas a conjuntos en minúsculas
set_1 = set(str(gene).lower() for gene in gene_prot)
set_2 = set(str(gene).lower() for gene in fatbody_df.columns)

# Intersección por método de la clase set
cg_col_set = set_1.intersection(set_2)

# Recupera los originales para seleccionar columnas del df
cg_col_df = [elemento for elemento in fatbody_df if elemento.lower() in cg_col_set]

fatbody_prot_genes = fatbody_df.loc[:, cg_col_df]
print('Genes totales: ', len(fatbody_df.columns))
print('Genes que codifican proteinas: ', len(fatbody_prot_genes.columns))

Genes totales:  16311
Genes que codifican proteinas:  12850
