# Anaconda

Los contenidos que se muestran en este curso se han realizado en lenguaje Python. Es conveniente instalar Anaconda, una plataforma centrada en facilitar ciencia con datos:

https://www.anaconda.com/

https://www.anaconda.com/download/

Es recomendable utilizar la última versión de Python con Anaconda.

## Estructuras de datos

* __Tuplas__ para agrupar objetos de diferentes tipos

In [1]:
import numpy as np
tupla1 = ('abc', np.arange(0,10,0.5), 2.5)
tupla1

('abc',
 array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
         5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5]),
 2.5)

In [3]:
tupla1[2]

2.5

* __Listas__ para agrupar objetos del mismo tipo

In [2]:
lista1 = ['abc', 'def', 'ghi']
lista2 = [1,2,3]
lista3 = [4,5,6]

In [3]:
lista2 + lista3

[1, 2, 3, 4, 5, 6]

* __Arrays__ son vectores y matrices para manipulación numérica

In [4]:
array1 = np.array(lista2)
array2 = np.array(lista3)
array1

array([1, 2, 3])

In [6]:
array1 + array2

array([5, 7, 9])

* __Diccionarios__ son colecciones no ordenadas de contenido 

In [7]:
diccionario1 = dict(uno=1, dos=2, info='alguna info')
diccionario2 = {'uno':1, 'dos':2, 'info':'otra info'}
diccionario1['info']

'alguna info'

In [8]:
diccionario2.keys()

dict_keys(['uno', 'dos', 'info'])

* __DataFrame__ Estructuras de datos optimizadas para trabajar con datos estadísticos (paquete `pandas`)

In [4]:
import pandas as pd

data = pd.DataFrame({
    'Genero': ['H', 'M', 'M', 'H', 'M', 'H', 'M', 'H', 'M', 'M'],
    'TiempoTwitter': [0.1, 0.5, 0.1, 0.2, 0.2, 0.3, 1.1, 1.21, 0.8, 0.9],
    'TiempoFacebook': [1.1, 0.5, 0.9, 0.8, 0.75, 0.35, 1.1, 1.1, 0.8, 0.9],
})

data

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1
3,H,0.8,0.2
4,M,0.75,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
Genero            10 non-null object
TiempoFacebook    10 non-null float64
TiempoTwitter     10 non-null float64
dtypes: float64(2), object(1)
memory usage: 320.0+ bytes


In [6]:
data.TiempoTwitter

0    0.10
1    0.50
2    0.10
3    0.20
4    0.20
5    0.30
6    1.10
7    1.21
8    0.80
9    0.90
Name: TiempoTwitter, dtype: float64

In [7]:
data['TiempoTwitter']

0    0.10
1    0.50
2    0.10
3    0.20
4    0.20
5    0.30
6    1.10
7    1.21
8    0.80
9    0.90
Name: TiempoTwitter, dtype: float64

#### Selección de filas y columnas

In [9]:
data.head(3)

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1


In [10]:
data.tail()

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


In [11]:
data.loc[0,:] # Seleccionando filas y columnas por ETIQUETAS

Genero              H
TiempoFacebook    1.1
TiempoTwitter     0.1
Name: 0, dtype: object

In [12]:
data.loc[[0,1,2], :]

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1


In [13]:
data.loc[0:2, :] # Inclusivo en ambos lados 0 y 2

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1


In [14]:
data.loc[0:2, ['Genero', 'TiempoFacebook']] # Seleccionamos columnas (variables)

Unnamed: 0,Genero,TiempoFacebook
0,H,1.1
1,M,0.5
2,M,0.9


In [15]:
data.loc[0:2, 'Genero':'TiempoFacebook'] # IDEM

Unnamed: 0,Genero,TiempoFacebook
0,H,1.1
1,M,0.5
2,M,0.9


In [16]:
data.head(3).drop('TiempoTwitter', axis=1) # IDEM

Unnamed: 0,Genero,TiempoFacebook
0,H,1.1
1,M,0.5
2,M,0.9


In [17]:
data.loc[data.Genero == 'M', :] # Selecccion con condicionales

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
1,M,0.5,0.5
2,M,0.9,0.1
4,M,0.75,0.2
6,M,1.1,1.1
8,M,0.8,0.8
9,M,0.9,0.9


In [18]:
data.loc[data.Genero == 'M', 'TiempoFacebook'] # Selecccion con condicionales

1    0.50
2    0.90
4    0.75
6    1.10
8    0.80
9    0.90
Name: TiempoFacebook, dtype: float64

In [19]:
data.iloc[:,0:2] # Seleccionamos con las POSICIONES
# NO ES INCLUSIVO POR LA DERECHA

Unnamed: 0,Genero,TiempoFacebook
0,H,1.1
1,M,0.5
2,M,0.9
3,H,0.8
4,M,0.75
5,H,0.35
6,M,1.1
7,H,1.1
8,M,0.8
9,M,0.9


In [20]:
# NO RECOMENDABLE. HAY QUE RECORDAR UNA LOGICA: PRIMERO LISTA, DESPUES SECUENCIA
data[['Genero', 'TiempoTwitter']][4:10]

Unnamed: 0,Genero,TiempoTwitter
4,M,0.2
5,H,0.3
6,M,1.1
7,H,1.21
8,M,0.8
9,M,0.9


In [21]:
data.set_index('Genero')

Unnamed: 0_level_0,TiempoFacebook,TiempoTwitter
Genero,Unnamed: 1_level_1,Unnamed: 2_level_1
H,1.1,0.1
M,0.5,0.5
M,0.9,0.1
H,0.8,0.2
M,0.75,0.2
H,0.35,0.3
M,1.1,1.1
H,1.1,1.21
M,0.8,0.8
M,0.9,0.9


In [22]:
data

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1
3,H,0.8,0.2
4,M,0.75,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


In [23]:
data.set_index('Genero', inplace=True)

In [24]:
data

Unnamed: 0_level_0,TiempoFacebook,TiempoTwitter
Genero,Unnamed: 1_level_1,Unnamed: 2_level_1
H,1.1,0.1
M,0.5,0.5
M,0.9,0.1
H,0.8,0.2
M,0.75,0.2
H,0.35,0.3
M,1.1,1.1
H,1.1,1.21
M,0.8,0.8
M,0.9,0.9


In [25]:
data.ix['H', 0] # The .ix indexer has been deprecated
# PARA UTILIZAR CUANDO loc e iloc NO CUBREN TUS NECESIDADES

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Genero
H    1.10
H    0.80
H    0.35
H    1.10
Name: TiempoFacebook, dtype: float64

In [26]:
data.reset_index(inplace=True)
#data.reset_index(drop=True, inplace=True)
data

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1
3,H,0.8,0.2
4,M,0.75,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


#### Selección multiple a través de filtros

In [38]:
data[data.TiempoFacebook > 1.0] # Una sola condicion

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
6,M,1.1,1.1
7,H,1.1,1.21


In [39]:
data[(data.TiempoFacebook > 1.0) & (data.Genero == 'H')] # Multiples criterios

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
7,H,1.1,1.21


In [40]:
data[(data.TiempoFacebook > 1.0) | (data.Genero == 'H')] # Multiples criterios

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
3,H,0.8,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21


In [41]:
# Multiples criterios con la MISMA VARIABLE
data[(data.Genero == 'M') | (data.Genero == 'H')]

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1
3,H,0.8,0.2
4,M,0.75,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


In [42]:
(data.Genero == 'M') | (data.Genero == 'H')

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: Genero, dtype: bool

In [43]:
data.Genero.isin(['H', 'M'])

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: Genero, dtype: bool

In [44]:
data[data.Genero.isin(['H', 'M'])]

Unnamed: 0,Genero,TiempoFacebook,TiempoTwitter
0,H,1.1,0.1
1,M,0.5,0.5
2,M,0.9,0.1
3,H,0.8,0.2
4,M,0.75,0.2
5,H,0.35,0.3
6,M,1.1,1.1
7,H,1.1,1.21
8,M,0.8,0.8
9,M,0.9,0.9


In [45]:
data.to_csv('datasets/datosRedesSociales.csv') # Guaramos los datos

# Carga de datos externa

### Desde R

Primero guardamos los datos en formato csv:

    write.table(data, file = "data.csv",row.names=FALSE, na="", col.names=TRUE, sep=",")

luego en Python

    import pandas
    data = pandas.read_csv('data.csv')

In [46]:
import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
print(duncan_prestige.__doc__)

  from pandas.core import datetools


+----------+-------------------+
| Duncan   | R Documentation   |
+----------+-------------------+

Duncan's Occupational Prestige Data
-----------------------------------

Description
~~~~~~~~~~~

The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.

Usage
~~~~~

::

    Duncan

Format
~~~~~~

This data frame contains the following columns:

type
    Type of occupation. A factor with the following levels: ``prof``,
    professional and managerial; ``wc``, white-collar; ``bc``,
    blue-collar.

income
    Percent of males in occupation earning $3500 or more in 1950.

education
    Percent of males in occupation in 1950 who were high-school
    graduates.

prestige
    Percent of raters in NORC study rating occupation as excellent or
    good in prestige.

Source
~~~~~~

Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Ta

In [47]:
duncan_prestige.data.head(5)

Unnamed: 0,type,income,education,prestige
accountant,prof,62,86,82
pilot,prof,72,76,83
architect,prof,75,92,90
author,prof,55,90,76
chemist,prof,64,86,90


In [49]:
ufo = pd.read_csv('https://raw.githubusercontent.com/AngelBerihuete/introstats/master/datasets/ufo.csv')
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18241 entries, 0 to 18240
Data columns (total 5 columns):
City               18216 non-null object
Colors Reported    2882 non-null object
Shape Reported     15597 non-null object
State              18241 non-null object
Time               18241 non-null object
dtypes: object(5)
memory usage: 712.6+ KB


# [Ampliación]

### ¿Tengo que saber algo más para el manejo de datos? Seleccionando datos con *numpy*

Se puede acceder a los elementos de las diferentes estructuras de datos de manera muy sencilla

In [50]:
a = np.arange(1.0, 10.0, 0.5)
a[:]


array([ 1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,
        6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [51]:
a[2:5]

array([ 2. ,  2.5,  3. ])

In [52]:
a[2:]

array([ 2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,  6.5,  7. ,
        7.5,  8. ,  8.5,  9. ,  9.5])

In [53]:
a[:2]

array([ 1. ,  1.5])

> Importante: recordad que __el indexado comienza en 0__

Pueden utilizarse también números negativos para empezar el indexado desde el final la estructura:

In [54]:
a[-1]

9.5

In [55]:
a[-2:]

array([ 9. ,  9.5])

## Un poco más sobre vectores y arrays

In [56]:
np.zeros(3) # Vector

array([ 0.,  0.,  0.])

In [57]:
np.zeros( (2,3) ) # Matriz: lista de listas

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [58]:
np.ones(4) # Vector

array([ 1.,  1.,  1.,  1.])

In [59]:
np.arange(1,3,0.5) # Vector que no incluye el 3, bueno para que no haya solape entre vectores!!

array([ 1. ,  1.5,  2. ,  2.5])

In [60]:
np.arange(3,5,0.5)

array([ 3. ,  3.5,  4. ,  4.5])

In [61]:
np.linspace(0,10,6)

array([  0.,   2.,   4.,   6.,   8.,  10.])

In [62]:
np.array([[2,3], [4,5]])

array([[2, 3],
       [4, 5]])

> __Importante__: un vector no es lo mismo que una matriz de una dimensión! El vector NO se puede transponer, la matriz sí.

In [63]:
x = np.arange(3)
x.T == x

array([ True,  True,  True], dtype=bool)

In [64]:
A = np.array([[2,3], [4,5]])
A.T == A

array([[ True, False],
       [False,  True]], dtype=bool)