# Módulos

En Python, cada script o archivo de código fuente, se denominan módulos. Estos módulos, a la vez, pueden formar parte de paquetes. Un paquete, es una carpeta que contiene archivos `.py`. Por ejemplo, si guardáramos el contenido de la función para obtener codones a partir de un string, y le ponemos la extensión `py` y lo guardamos como `get_codons.py` sería un script de python, si el script está en la misma carpeta del notebook yo podría importarlo así:

In [None]:
import get_codons

Eso me permite reutilizar mi código. Afortunadamente python contienen módulos `built-in`, métodos integrados. Además, podemos instalar nuevos paquetes que contienen módulos con `pip`, el instalador oficial de Python, o con `conda`, el gestor de paquetes de Anaconda Inc.

In [1]:
import calendar

In [5]:
print(calendar.month(2020, 7))

     July 2020
Mo Tu We Th Fr Sa Su
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31



Es posible también, abreviar los namespaces mediante un alias. Para ello, durante la importación, se asigna la palabra clave as seguida del alias con el cuál nos referiremos en el futuro a ese namespace importado:

- `import modulo`
- `import modulo as m`
- `import paquete.modulo1 as pm`
- `import paquete.subpaquete.modulo1 as psm`

In [7]:
import calendar

In [None]:
print(calendar.month(2020, 7))

In [9]:
from calendar import month

In [10]:
print(month(2020, 7))

     July 2020
Mo Tu We Th Fr Sa Su
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31



In [11]:
import calendar as c

In [12]:
print(c.month(2020, 7))

     July 2020
Mo Tu We Th Fr Sa Su
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31



# Pandas

<center><img src="imgs/pandas.png" width=700 height=700/></center>

In [2]:
import pandas as pd

In [11]:
# import atgenomics as atg

In [3]:
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel:

read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)
    Read an Excel table into a pandas DataFrame
    
    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object, pandas ExcelFile, or xlrd workbook.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        and file. For file URLs, a host is expected. For instance, a local
        file could be file://localhost/path/to/workbook.xlsx
    sheetname : string, int, mixed list of strings/ints, or None, default 0
    
        Strings are used for sheet names, Integers are used in zero-indexed
        sheet positions.
    
        Lists

In [4]:
help(pd.read_table)

Help on function read_table in module pandas.io.parsers:

read_table(filepath_or_buffer, sep='\t', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precisi

In [5]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=No

In [6]:
df = pd.read_csv('data/IGC.annotation.tsv.gz', sep='\t')

In [8]:
# Para notebooks de Azure
! mkdir -p data
! curl https://data-unix.s3-us-west-1.amazonaws.com/IGC.annotation.tsv.gz -o data/IGC.annotation.tsv.gz
df = pd.read_csv("data/IGC.annotation.tsv.gz", sep='\t')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.2M  100 15.2M    0     0  2026k      0  0:00:07  0:00:07 --:--:-- 2055k


In [None]:
# Para Notebooks de Jupyter instalado de manera local
df = pd.read_csv('https://data-unix.s3-us-west-1.amazonaws.com/IGC.annotation.tsv.gz', sep='\t')

In [7]:
type(df)

pandas.core.frame.DataFrame

In [8]:
df.head()

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,5209933,MH0396_GL0114156,549,Lack both ends,EUR,unknown,unknown,unknown,COG4932,0.008682,0.009346,unknown,Cell wall/membrane/envelope biogenesis,EUR
1,6811315,MH0012_GL0174453,372,Complete,EUR,unknown,unknown,K00335,COG3411,0.187056,0.195327,Energy Metabolism,Energy production and conversion,EUR;CHN;USA
2,7221353,MH0389_GL0170585,330,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.015785,0.018692,unknown,unknown,EUR
3,8950791,DOM015_GL0050638,177,Lack 5'-end,CHN,unknown,unknown,unknown,NOG125034,0.008682,0.008411,unknown,Function unknown,CHN
4,2441669,469596.HMPREF9488_01493,984,Complete,SP,Firmicutes,Coprobacillus,unknown,unknown,0.112865,0.113084,unknown,unknown,EUR;CHN;USA


In [9]:
df.tail()

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
499995,278939,765560005-stool1_revised_scaffold5052_1_gene73628,2349,Lack 5'-end,USA,unknown,unknown,K00571,COG1002,0.153907,0.152336,Genetic Information Processing,Defense mechanisms,EUR;USA
499996,691723,638754422-stool2_revised_scaffold23317_1_gene1940,1698,Complete,USA,unknown,unknown,unknown,unknown,0.001579,0.000935,unknown,unknown,USA
499997,4924959,MH0161_GL0035075,579,Complete,EUR,unknown,unknown,unknown,unknown,0.054459,0.064486,unknown,unknown,EUR;CHN
499998,7655040,764042746-stool2_revised_scaffold22088_1_gene3...,288,Complete,USA,unknown,unknown,unknown,unknown,0.003946,0.002804,unknown,unknown,USA
499999,5457503,V1.UC36-0_GL0134515,525,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.008682,0.007477,unknown,unknown,EUR


In [10]:
df.shape

(500000, 14)

In [11]:
df.dtypes

Gene ID                                 int64
Gene Name                              object
Gene Length                             int64
Gene Completeness                      object
Cohort Origin                          object
Taxonomic Annotation(Phylum Level)     object
Taxonomic Annotation(Genus Level)      object
KEGG Annotation                        object
eggNOG Annotation                      object
Sample Occurence Frequency            float64
Individual Occurence Frequency        float64
KEGG Functional Categories             object
eggNOG Functional Categories           object
Cohort Assembled                       object
dtype: object

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 14 columns):
Gene ID                               500000 non-null int64
Gene Name                             500000 non-null object
Gene Length                           500000 non-null int64
Gene Completeness                     499941 non-null object
Cohort Origin                         500000 non-null object
Taxonomic Annotation(Phylum Level)    500000 non-null object
Taxonomic Annotation(Genus Level)     500000 non-null object
KEGG Annotation                       500000 non-null object
eggNOG Annotation                     500000 non-null object
Sample Occurence Frequency            500000 non-null float64
Individual Occurence Frequency        500000 non-null float64
KEGG Functional Categories            500000 non-null object
eggNOG Functional Categories          500000 non-null object
Cohort Assembled                      491802 non-null object
dtypes: float64(2), int64(2), objec

In [13]:
df.describe()

Unnamed: 0,Gene ID,Gene Length,Sample Occurence Frequency,Individual Occurence Frequency
count,500000.0,500000.0,500000.0,500000.0
mean,4939795.0,753.319838,0.077174,0.076986
std,2851155.0,686.058789,0.139305,0.138812
min,6.0,102.0,0.0,0.0
25%,2471293.0,312.0,0.005525,0.005607
50%,4947035.0,579.0,0.019732,0.020561
75%,7402792.0,975.0,0.077348,0.076636
max,9879844.0,41277.0,0.998421,0.999065


In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Gene ID,500000.0,4939795.0,2851155.0,6.0,2471293.0,4947035.0,7402792.0,9879844.0
Gene Length,500000.0,753.3198,686.0588,102.0,312.0,579.0,975.0,41277.0
Sample Occurence Frequency,500000.0,0.07717352,0.1393053,0.0,0.005524862,0.01973165,0.07734807,0.9984215
Individual Occurence Frequency,500000.0,0.0769858,0.1388122,0.0,0.005607477,0.02056075,0.07663551,0.9990654


In [15]:
description = df.describe().T

In [16]:
description.to_csv('data/description.tsv', sep='\t', index=True)

In [17]:
df.head(100).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Gene ID,100.0,4975205.0,2903049.0,189878.0,2227776.0,5258160.0,7551792.0,9668785.0
Gene Length,100.0,724.77,547.6866,117.0,297.75,544.5,1035.75,2646.0
Sample Occurence Frequency,100.0,0.1143094,0.1764729,0.000789,0.006314128,0.02683504,0.1168114,0.7884767
Individual Occurence Frequency,100.0,0.1129252,0.1748951,0.0,0.005607477,0.02523364,0.1142523,0.782243


In [18]:
# df.T.describe()

In [19]:
# df2 = df.T

In [20]:
# df2.to_csv('data/dataframe_inver.tsv', sep='\t', index=False)

In [21]:
df.columns

Index(['Gene ID', 'Gene Name', 'Gene Length', 'Gene Completeness',
       'Cohort Origin', 'Taxonomic Annotation(Phylum Level)',
       'Taxonomic Annotation(Genus Level)', 'KEGG Annotation',
       'eggNOG Annotation', 'Sample Occurence Frequency',
       'Individual Occurence Frequency', 'KEGG Functional Categories',
       'eggNOG Functional Categories', 'Cohort Assembled'],
      dtype='object')

In [22]:
type(df.shape)

tuple

In [23]:
type(df.dtypes)

pandas.core.series.Series

In [24]:
type(df.columns)

pandas.indexes.base.Index

In [25]:
df['Gene ID']

0         5209933
1         6811315
2         7221353
3         8950791
4         2441669
5         4807882
6          693534
7          647891
8         7928769
9         8920818
10        1821827
11        5707244
12        9553442
13        5831026
14        8222582
15        5870802
16        5203442
17        6226899
18        8094620
19        6068154
20        9139876
21        9031743
22        4137075
23        3650803
24        2312660
25        7219798
26        1293228
27        3320656
28        2811850
29        5500609
           ...   
499970    4238583
499971    6992342
499972     186663
499973    4517482
499974    5233095
499975    3484802
499976    3280488
499977    8177332
499978    9145946
499979    7115654
499980    7451388
499981    8734258
499982    7786169
499983     556756
499984    4147487
499985    9067856
499986    6567495
499987    6439309
499988    5254719
499989    6961318
499990    6294127
499991    9631397
499992    8040629
499993    7363211
499994    

In [26]:
df['Gene ID'].head()

0    5209933
1    6811315
2    7221353
3    8950791
4    2441669
Name: Gene ID, dtype: int64

In [None]:
df['Gene ID'].head()

In [27]:
df[['Gene ID', 'Gene Name']].head()

Unnamed: 0,Gene ID,Gene Name
0,5209933,MH0396_GL0114156
1,6811315,MH0012_GL0174453
2,7221353,MH0389_GL0170585
3,8950791,DOM015_GL0050638
4,2441669,469596.HMPREF9488_01493


In [28]:
df[['Gene ID', 'Gene Name']].head().T

Unnamed: 0,0,1,2,3,4
Gene ID,5209933,6811315,7221353,8950791,2441669
Gene Name,MH0396_GL0114156,MH0012_GL0174453,MH0389_GL0170585,DOM015_GL0050638,469596.HMPREF9488_01493


In [29]:
df.index

RangeIndex(start=0, stop=500000, step=1)

In [30]:
type(df.index)

pandas.indexes.range.RangeIndex

In [31]:
help(df.sort_index)

Help on method sort_index in module pandas.core.frame:

sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None) method of pandas.core.frame.DataFrame instance
    Sort object by labels (along an axis)
    
    Parameters
    ----------
    axis : index, columns to direct sorting
    level : int or level name or list of ints or list of level names
        if not None, sort on values in specified index level(s)
    ascending : boolean, default True
        Sort ascending vs. descending
    inplace : bool, default False
        if True, perform operation in-place
    kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
         Choice of sorting algorithm. See also ndarray.np.sort for more
         information.  `mergesort` is the only stable algorithm. For
         DataFrames, this option is only applied when sorting on a single
         column or label.
    na_position : {'first', 'last'}, default 'l

In [32]:
# Ordenar por filas(axis 0), Ascendente
df.sort_index(axis=0, ascending=True)

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
0,5209933,MH0396_GL0114156,549,Lack both ends,EUR,unknown,unknown,unknown,COG4932,0.008682,0.009346,unknown,Cell wall/membrane/envelope biogenesis,EUR
1,6811315,MH0012_GL0174453,372,Complete,EUR,unknown,unknown,K00335,COG3411,0.187056,0.195327,Energy Metabolism,Energy production and conversion,EUR;CHN;USA
2,7221353,MH0389_GL0170585,330,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.015785,0.018692,unknown,unknown,EUR
3,8950791,DOM015_GL0050638,177,Lack 5'-end,CHN,unknown,unknown,unknown,NOG125034,0.008682,0.008411,unknown,Function unknown,CHN
4,2441669,469596.HMPREF9488_01493,984,Complete,SP,Firmicutes,Coprobacillus,unknown,unknown,0.112865,0.113084,unknown,unknown,EUR;CHN;USA
5,4807882,V1.FI14_GL0101784,594,Complete,EUR,unknown,unknown,unknown,unknown,0.008682,0.009346,unknown,unknown,EUR
6,693534,MH0448_GL0045910,1695,Lack both ends,EUR,unknown,unknown,unknown,NOG12793,0.004736,0.004673,unknown,Function unknown,EUR
7,647891,MH0433_GL0005093,1740,Complete,EUR,unknown,unknown,unknown,unknown,0.003946,0.004673,unknown,unknown,EUR;USA
8,7928769,O2.UC48-1_GL0076618,261,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.014996,0.008411,unknown,unknown,EUR
9,8920818,764588959-stool1_revised_C775588_1_gene178391,180,Lack 3'-end,USA,unknown,unknown,unknown,unknown,0.000789,0.000935,unknown,unknown,USA


In [33]:
# Ordenar por filas(axis 0), descendente
df.sort_index(axis=0, ascending=False)

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
499999,5457503,V1.UC36-0_GL0134515,525,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.008682,0.007477,unknown,unknown,EUR
499998,7655040,764042746-stool2_revised_scaffold22088_1_gene3...,288,Complete,USA,unknown,unknown,unknown,unknown,0.003946,0.002804,unknown,unknown,USA
499997,4924959,MH0161_GL0035075,579,Complete,EUR,unknown,unknown,unknown,unknown,0.054459,0.064486,unknown,unknown,EUR;CHN
499996,691723,638754422-stool2_revised_scaffold23317_1_gene1940,1698,Complete,USA,unknown,unknown,unknown,unknown,0.001579,0.000935,unknown,unknown,USA
499995,278939,765560005-stool1_revised_scaffold5052_1_gene73628,2349,Lack 5'-end,USA,unknown,unknown,K00571,COG1002,0.153907,0.152336,Genetic Information Processing,Defense mechanisms,EUR;USA
499994,2951767,V1.UC26-4_GL0007902,879,Complete,EUR,unknown,unknown,unknown,NOG47185,0.004736,0.003738,unknown,Function unknown,EUR
499993,7363211,MH0260_GL0023264,315,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.003946,0.004673,unknown,unknown,EUR
499992,8040629,159611913-stool1_revised_C786324_1_gene54480,252,Complete,USA,Firmicutes,Faecalibacterium,unknown,unknown,0.018942,0.017757,unknown,unknown,USA
499991,9631397,MH0280_GL0114474,120,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.037096,0.033645,unknown,unknown,EUR
499990,6294127,V1.UC26-4_GL0011854,432,Lack 5'-end,EUR,unknown,unknown,K07024,COG0561,0.002368,0.001869,Poorly Characterized,General function prediction only,EUR


In [34]:
# Ordenar por columnas(axis 1), Ascendente
df.sort_index(axis=1, ascending=True)

Unnamed: 0,Cohort Assembled,Cohort Origin,Gene Completeness,Gene ID,Gene Length,Gene Name,Individual Occurence Frequency,KEGG Annotation,KEGG Functional Categories,Sample Occurence Frequency,Taxonomic Annotation(Genus Level),Taxonomic Annotation(Phylum Level),eggNOG Annotation,eggNOG Functional Categories
0,EUR,EUR,Lack both ends,5209933,549,MH0396_GL0114156,0.009346,unknown,unknown,0.008682,unknown,unknown,COG4932,Cell wall/membrane/envelope biogenesis
1,EUR;CHN;USA,EUR,Complete,6811315,372,MH0012_GL0174453,0.195327,K00335,Energy Metabolism,0.187056,unknown,unknown,COG3411,Energy production and conversion
2,EUR,EUR,Lack 5'-end,7221353,330,MH0389_GL0170585,0.018692,unknown,unknown,0.015785,unknown,unknown,unknown,unknown
3,CHN,CHN,Lack 5'-end,8950791,177,DOM015_GL0050638,0.008411,unknown,unknown,0.008682,unknown,unknown,NOG125034,Function unknown
4,EUR;CHN;USA,SP,Complete,2441669,984,469596.HMPREF9488_01493,0.113084,unknown,unknown,0.112865,Coprobacillus,Firmicutes,unknown,unknown
5,EUR,EUR,Complete,4807882,594,V1.FI14_GL0101784,0.009346,unknown,unknown,0.008682,unknown,unknown,unknown,unknown
6,EUR,EUR,Lack both ends,693534,1695,MH0448_GL0045910,0.004673,unknown,unknown,0.004736,unknown,unknown,NOG12793,Function unknown
7,EUR;USA,EUR,Complete,647891,1740,MH0433_GL0005093,0.004673,unknown,unknown,0.003946,unknown,unknown,unknown,unknown
8,EUR,EUR,Lack 3'-end,7928769,261,O2.UC48-1_GL0076618,0.008411,unknown,unknown,0.014996,unknown,unknown,unknown,unknown
9,USA,USA,Lack 3'-end,8920818,180,764588959-stool1_revised_C775588_1_gene178391,0.000935,unknown,unknown,0.000789,unknown,unknown,unknown,unknown


In [35]:
# Ordenar por columnas(axis 1), descendente
df.sort_index(axis=0, ascending=False)

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
499999,5457503,V1.UC36-0_GL0134515,525,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.008682,0.007477,unknown,unknown,EUR
499998,7655040,764042746-stool2_revised_scaffold22088_1_gene3...,288,Complete,USA,unknown,unknown,unknown,unknown,0.003946,0.002804,unknown,unknown,USA
499997,4924959,MH0161_GL0035075,579,Complete,EUR,unknown,unknown,unknown,unknown,0.054459,0.064486,unknown,unknown,EUR;CHN
499996,691723,638754422-stool2_revised_scaffold23317_1_gene1940,1698,Complete,USA,unknown,unknown,unknown,unknown,0.001579,0.000935,unknown,unknown,USA
499995,278939,765560005-stool1_revised_scaffold5052_1_gene73628,2349,Lack 5'-end,USA,unknown,unknown,K00571,COG1002,0.153907,0.152336,Genetic Information Processing,Defense mechanisms,EUR;USA
499994,2951767,V1.UC26-4_GL0007902,879,Complete,EUR,unknown,unknown,unknown,NOG47185,0.004736,0.003738,unknown,Function unknown,EUR
499993,7363211,MH0260_GL0023264,315,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.003946,0.004673,unknown,unknown,EUR
499992,8040629,159611913-stool1_revised_C786324_1_gene54480,252,Complete,USA,Firmicutes,Faecalibacterium,unknown,unknown,0.018942,0.017757,unknown,unknown,USA
499991,9631397,MH0280_GL0114474,120,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.037096,0.033645,unknown,unknown,EUR
499990,6294127,V1.UC26-4_GL0011854,432,Lack 5'-end,EUR,unknown,unknown,K07024,COG0561,0.002368,0.001869,Poorly Characterized,General function prediction only,EUR


In [36]:
df.sort_values(by='Gene Length')

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
88836,9865705,V1.CD6-0-PT_GL0096973,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.005525,0.005607,unknown,unknown,EUR
476498,9862460,O2.UC4-2_GL0053162,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.021310,0.020561,unknown,unknown,EUR
151345,9862491,O2.UC43-0_GL0033611,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.015785,0.014953,unknown,unknown,EUR
34192,9877190,159551223-stool2_revised_C916401_1_gene84247,102,Lack 3'-end,USA,Firmicutes,Eubacterium,unknown,unknown,0.014207,0.012150,unknown,unknown,USA
331083,9860951,O2.UC21-1_GL0136539,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.059195,0.061682,unknown,unknown,EUR
280181,9857596,MH0415_GL0056639,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.020521,0.023364,unknown,unknown,EUR
483179,9878379,763577454-stool2_revised_scaffold23171_1_gene1...,102,Lack 5'-end,USA,Proteobacteria,unknown,unknown,unknown,0.169692,0.157944,unknown,unknown,USA
370156,9853996,MH0335_GL0009364,102,Complete,EUR,unknown,unknown,unknown,unknown,0.004736,0.005607,unknown,unknown,EUR
409047,9860917,O2.UC2-0_GL0121235,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.011050,0.007477,unknown,unknown,EUR
34135,9872403,N003A_GL0007810,102,Lack 5'-end,CHN,unknown,unknown,unknown,unknown,0.004736,0.005607,unknown,unknown,CHN


In [37]:
help(df)

Help on DataFrame in module pandas.core.frame object:

class DataFrame(pandas.core.generic.NDFrame)
 |  Two-dimensional size-mutable, potentially heterogeneous tabular data
 |  structure with labeled axes (rows and columns). Arithmetic operations
 |  align on both row and column labels. Can be thought of as a dict-like
 |  container for Series objects. The primary pandas data structure
 |  
 |  Parameters
 |  ----------
 |  data : numpy ndarray (structured or homogeneous), dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects
 |  index : Index or array-like
 |      Index to use for resulting frame. Will default to np.arange(n) if
 |      no indexing information part of input data and no index provided
 |  columns : Index or array-like
 |      Column labels to use for resulting frame. Will default to
 |      np.arange(n) if no column labels are provided
 |  dtype : dtype, default None
 |      Data type to force, otherwise infer
 |  copy : boolean, d

In [38]:
df.sort_values(by='Gene Length', ascending=True)

Unnamed: 0,Gene ID,Gene Name,Gene Length,Gene Completeness,Cohort Origin,Taxonomic Annotation(Phylum Level),Taxonomic Annotation(Genus Level),KEGG Annotation,eggNOG Annotation,Sample Occurence Frequency,Individual Occurence Frequency,KEGG Functional Categories,eggNOG Functional Categories,Cohort Assembled
88836,9865705,V1.CD6-0-PT_GL0096973,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.005525,0.005607,unknown,unknown,EUR
476498,9862460,O2.UC4-2_GL0053162,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.021310,0.020561,unknown,unknown,EUR
151345,9862491,O2.UC43-0_GL0033611,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.015785,0.014953,unknown,unknown,EUR
34192,9877190,159551223-stool2_revised_C916401_1_gene84247,102,Lack 3'-end,USA,Firmicutes,Eubacterium,unknown,unknown,0.014207,0.012150,unknown,unknown,USA
331083,9860951,O2.UC21-1_GL0136539,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.059195,0.061682,unknown,unknown,EUR
280181,9857596,MH0415_GL0056639,102,Lack 3'-end,EUR,unknown,unknown,unknown,unknown,0.020521,0.023364,unknown,unknown,EUR
483179,9878379,763577454-stool2_revised_scaffold23171_1_gene1...,102,Lack 5'-end,USA,Proteobacteria,unknown,unknown,unknown,0.169692,0.157944,unknown,unknown,USA
370156,9853996,MH0335_GL0009364,102,Complete,EUR,unknown,unknown,unknown,unknown,0.004736,0.005607,unknown,unknown,EUR
409047,9860917,O2.UC2-0_GL0121235,102,Lack 5'-end,EUR,unknown,unknown,unknown,unknown,0.011050,0.007477,unknown,unknown,EUR
34135,9872403,N003A_GL0007810,102,Lack 5'-end,CHN,unknown,unknown,unknown,unknown,0.004736,0.005607,unknown,unknown,CHN


**Funciones utilizadas el días de hoy**

* `df.head()`
* `df.tail()`
* `df.dtypes`
* `df.shape`
* `df.info()`
* `df.describe()`
* `df.T`
* `df.index`
* `df.columns`
* `df.sort_index()`
* `df.sort_values()`
* `df.to_csv()`