# Pandas

| Función      | Descripción                                                                                     |
|--------------|-------------------------------------------------------------------------------------------------|
| read_csv     | Carga datos delimitados desde un archivo o URL, la coma (',') es el delimitador predeterminado |
| read_table   | Carga datos delimitados desde un archivo o URL, la tabulación ('\t') es el delimitador predeterminado |


## Creación de un DataFrame

In [4]:
import pandas as pd

# lectura desde URL
url = "https://gist.githubusercontent.com/w-dan/b84bdbfbd86b610a89aa0aa57e6efb5e/raw/aeaffaa9d26c66678b697ecb9fd99fd0c8eaca52/papers.csv"
df = pd.read_csv(url)

# mostrar dataframe
df

Unnamed: 0,id,title,abstract,Unnamed: 3,keywords,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,1,Ensemble Statistical and Heuristic Models for ...,Statistical word alignment models need large a...,,"statistical word alignment, ensemble learning,...",,,
1,2,Improving Spectral Learning by Using Multiple ...,Spectral learning algorithms learn an unknown ...,,"representation, spectral learning, discrete fo...",,,
2,3,Applying Swarm Ensemble Clustering Technique f...,Number of defects remaining in a system provid...,,"software defect prediction, particle swarm opt...",,,
3,4,Reducing the Effects of Detrimental Instances,Not all instances in a data set are equally be...,,"filtering, label noise, instance weighting",,,
4,5,Concept Drift Awareness in Twitter Streams,Learning in non-stationary environments is not...,,"twitter, adaptation models, time-frequency ana...",,,
...,...,...,...,...,...,...,...,...
443,444,A Machine Learning Tool for Supporting Advance...,"In the current era of big data, high volumes o...",,"machine-learning,unsupervised-learning,knowled...",,,
444,445,Advanced ECHMM-Based Machine Learning Tools fo...,We present a novel approach for accurate chara...,,"workload characterization,hmm,cepstral coeffic...",,,
445,446,A Cluster Analysis of Challenging Behaviors in...,"We apply cluster analysis to a sample of 2,116...",,"cluster analysis,autism spectrum disorder,chal...",,,
446,447,Predicting Psychosis Using the Experience Samp...,Smart phones have become ubiquitous in the rec...,,"predicting psychosis,esm,mhealth,svm,gaussian ...",,,


Parece que tiene muchos valores nulos... Vamos a intentar deshacernos de ellos para obtener un DataFrame más limpio:

# Ejemplo de limpieza de un DataFrame

## Método *drop*

Parámetros:
- `labels`: **array** con los nombres de las columnas a borrar
- `axis`: **entero** (1=columna, 0=fila)

Devuelve:
- `DataFrame` sin las columnas (o filas) especificadas

In [27]:
df = df.drop(['Unnamed: 3', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7'], axis=1)

df

Unnamed: 0,id,title,abstract,keywords,abstract_length
0,1,Ensemble Statistical and Heuristic Models for ...,Statistical word alignment models need large a...,"statistical word alignment, ensemble learning,...",110
1,2,Improving Spectral Learning by Using Multiple ...,Spectral learning algorithms learn an unknown ...,"representation, spectral learning, discrete fo...",100
2,3,Applying Swarm Ensemble Clustering Technique f...,Number of defects remaining in a system provid...,"software defect prediction, particle swarm opt...",111
3,4,Reducing the Effects of Detrimental Instances,Not all instances in a data set are equally be...,"filtering, label noise, instance weighting",193
4,5,Concept Drift Awareness in Twitter Streams,Learning in non-stationary environments is not...,"twitter, adaptation models, time-frequency ana...",174
...,...,...,...,...,...
443,444,A Machine Learning Tool for Supporting Advance...,"In the current era of big data, high volumes o...","machine-learning,unsupervised-learning,knowled...",200
444,445,Advanced ECHMM-Based Machine Learning Tools fo...,We present a novel approach for accurate chara...,"workload characterization,hmm,cepstral coeffic...",243
445,446,A Cluster Analysis of Challenging Behaviors in...,"We apply cluster analysis to a sample of 2,116...","cluster analysis,autism spectrum disorder,chal...",119
446,447,Predicting Psychosis Using the Experience Samp...,Smart phones have become ubiquitous in the rec...,"predicting psychosis,esm,mhealth,svm,gaussian ...",229


## Método unique

Método de la clase `column` de pandas

Devuelve:
- Array con los valores únicos de una columna

In [7]:
df['id'].unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

## Método nunique


Análogo a unique

Devuelve:
- un `entero` que corresponde al número de valores distintos de la columna

In [8]:
df['id'].nunique()

448

Equivale a:

In [9]:
len(df['id'].unique())

448

## Método isnull

In [30]:
df.isnull().sum()

id                 0
title              0
abstract           0
keywords           0
abstract_length    0
dtype: int64

## Método apply

In [29]:
df['abstract_length'] = df['abstract'].apply(lambda x: len(x.split()))

df

Unnamed: 0,id,title,abstract,keywords,abstract_length
0,1,Ensemble Statistical and Heuristic Models for ...,Statistical word alignment models need large a...,"statistical word alignment, ensemble learning,...",110
1,2,Improving Spectral Learning by Using Multiple ...,Spectral learning algorithms learn an unknown ...,"representation, spectral learning, discrete fo...",100
2,3,Applying Swarm Ensemble Clustering Technique f...,Number of defects remaining in a system provid...,"software defect prediction, particle swarm opt...",111
3,4,Reducing the Effects of Detrimental Instances,Not all instances in a data set are equally be...,"filtering, label noise, instance weighting",193
4,5,Concept Drift Awareness in Twitter Streams,Learning in non-stationary environments is not...,"twitter, adaptation models, time-frequency ana...",174
...,...,...,...,...,...
443,444,A Machine Learning Tool for Supporting Advance...,"In the current era of big data, high volumes o...","machine-learning,unsupervised-learning,knowled...",200
444,445,Advanced ECHMM-Based Machine Learning Tools fo...,We present a novel approach for accurate chara...,"workload characterization,hmm,cepstral coeffic...",243
445,446,A Cluster Analysis of Challenging Behaviors in...,"We apply cluster analysis to a sample of 2,116...","cluster analysis,autism spectrum disorder,chal...",119
446,447,Predicting Psychosis Using the Experience Samp...,Smart phones have become ubiquitous in the rec...,"predicting psychosis,esm,mhealth,svm,gaussian ...",229


# Concatenación y agrupamiento

In [15]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])

result = pd.concat([df1, df2, df3])
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


Supongamos que quisiéramos asociar claves específicas con cada una de las piezas del DataFrame cortado. Podemos hacer esto usando el argumento claves:

In [13]:
# concatenación creando nuevas claves
# equivalente pd.concat({'x': df1, 'y': df2, 'z': df3})
result = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
result

Unnamed: 0,Unnamed: 1,A,B,C,D
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [16]:
result = pd.Concatenaciónconcat([df1, df4], ignore_index=True, sort=False)
result

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,,B2,,D2,F2
5,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [17]:
result = pd.concat([df1, df4], axis=1, sort=False)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


## Agrupación

In [19]:
data = {'animal': ['cat', 'dog', 'bat', 'penguin'],
'num_legs': [4, 4, 2, 2],
'num_wings': [0, 0, 2, 2],
'class': ['mammal', 'mammal', 'mammal', 'bird'],
'locomotion': ['walks', 'walks', 'flies', 'walks']}
animals = pd.DataFrame(data)
animals

Unnamed: 0,animal,num_legs,num_wings,class,locomotion
0,cat,4,0,mammal,walks
1,dog,4,0,mammal,walks
2,bat,2,2,mammal,flies
3,penguin,2,2,bird,walks


In [22]:
# por defecto axis=0
grouped = animals.groupby('class')
grouped.groups

{'bird': [3], 'mammal': [0, 1, 2]}

In [25]:
for name, group in grouped:
  print('Grupo: ' + str(name))
  print("-" * 40)
  print(group, end="\n\n")

Grupo: bird
----------------------------------------
    animal  num_legs  num_wings class locomotion
3  penguin         2          2  bird      walks

Grupo: mammal
----------------------------------------
  animal  num_legs  num_wings   class locomotion
0    cat         4          0  mammal      walks
1    dog         4          0  mammal      walks
2    bat         2          2  mammal      flies



Unnamed: 0,id,title,abstract,keywords,abstract_length
0,1,Ensemble Statistical and Heuristic Models for ...,Statistical word alignment models need large a...,"statistical word alignment, ensemble learning,...",110
1,2,Improving Spectral Learning by Using Multiple ...,Spectral learning algorithms learn an unknown ...,"representation, spectral learning, discrete fo...",100
2,3,Applying Swarm Ensemble Clustering Technique f...,Number of defects remaining in a system provid...,"software defect prediction, particle swarm opt...",111
3,4,Reducing the Effects of Detrimental Instances,Not all instances in a data set are equally be...,"filtering, label noise, instance weighting",193
4,5,Concept Drift Awareness in Twitter Streams,Learning in non-stationary environments is not...,"twitter, adaptation models, time-frequency ana...",174
...,...,...,...,...,...
443,444,A Machine Learning Tool for Supporting Advance...,"In the current era of big data, high volumes o...","machine-learning,unsupervised-learning,knowled...",200
444,445,Advanced ECHMM-Based Machine Learning Tools fo...,We present a novel approach for accurate chara...,"workload characterization,hmm,cepstral coeffic...",243
445,446,A Cluster Analysis of Challenging Behaviors in...,"We apply cluster analysis to a sample of 2,116...","cluster analysis,autism spectrum disorder,chal...",119
446,447,Predicting Psychosis Using the Experience Samp...,Smart phones have become ubiquitous in the rec...,"predicting psychosis,esm,mhealth,svm,gaussian ...",229
