# Tutorial Manejo de Datos y Pandas

## Estructuras de Datos e Índices


Pandas soporta la lectura de una amplia cantidad de formatos ([más info](http://pandas.pydata.org/pandas-docs/stable/io.html)): 

- read_csv
- read_excel
- read_hdf
- read_sql
- read_json
- read_msgpack (experimental)
- read_html
- read_gbq (experimental)
- read_stata
- read_sas
- read_clipboard
- read_pickle

Vamos a empezar a probar con una dataset publicado para una competencia de kaggle: [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data).

In [1]:
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt

In [2]:
#setup para el notebook

%matplotlib inline
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:,.2f}'.format
plt.rcParams['figure.figsize'] = (16, 12)

In [3]:
data = pd.read_csv("../data/titanic.csv")
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C


In [4]:
data.index

RangeIndex(start=0, stop=891, step=1)

In [5]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [6]:
data.set_index("PassengerId").index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            882, 883, 884, 885, 886, 887, 888, 889, 890, 891],
           dtype='int64', name='PassengerId', length=891)

Las estructuras de datos en pandas, por lo general, no son modificadas en vivo con comandos como `set_index`, para hacer eso es necesario cambiar el argumento `inplace` o reasignar la variables

In [7]:
#se puede hacer esto
data.set_index("PassengerId", inplace=True)
# o esto
# data = data.set_index("PassengerId")
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C


## Tipos de Indexado

Hay varias formas de seleccionar un subconjunto de los datos:

- Como las listas o arrays, por posición.
- Como los diccionarios, por llave o etiqueta.
- Como los arrays, por máscaras de verdadero o falso.
- Se puede indexar por número, rango o lista (array)
- Todos estos métodos pueden funcionar subconjunto como en las columnas


## Reglas Básicas

1. Se usan corchetes (abreviatura para el método `__getitem__`) para seleccionar columnas de un `DataFrame`

    ```python
    >>> df[['a', 'b', 'c']]
    ```

2. Se usa `.iloc` para indexar por posición (tanto filas como columnas)

    ```python
    >>> df.iloc[[1, 3], [0, 2]]
    ```
    
3. Se usa `.loc` para indexar por etiquetas (tanto filas como columnas)

    ```python
    >>> df.loc[["elemento1", "elemento2", "elemento3"], ["columna1", "columna2"]]
    ```

4. `ix` permite mezclar etiquetas y posiciones (tanto filas como columnas)

    ```python
    >>> df.loc[["elemento1", "elemento2", "elemento3"], [0, 2]]
    ```
    ```python
    >>> df.iloc[[1, 3], ["columna1", "columna2"]]
    ```


In [8]:
data.loc[[1, 2, 3], ["Name", "Sex"]]

Unnamed: 0_level_0,Name,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",male
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
3,"Heikkinen, Miss. Laina",female


In [9]:
data.iloc[[1, 2, 3], [2, 3]]

Unnamed: 0_level_0,Name,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
3,"Heikkinen, Miss. Laina",female
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female


In [10]:
data.ix[[1, 2, 3], ["Name", "Sex"]]

Unnamed: 0_level_0,Name,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",male
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
3,"Heikkinen, Miss. Laina",female


In [11]:
temp = data.copy()
temp.index = ["elemento_" + str(i) for i in temp.index]
temp

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
elemento_1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S
elemento_2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C
elemento_3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S
elemento_4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S
elemento_5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...
elemento_887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S
elemento_888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S
elemento_889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
elemento_890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C


In [12]:
temp.loc[["elemento_1", "elemento_2", "elemento_3"], ["Name", "Sex"]]

Unnamed: 0,Name,Sex
elemento_1,"Braund, Mr. Owen Harris",male
elemento_2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
elemento_3,"Heikkinen, Miss. Laina",female


In [13]:
temp.iloc[[1, 2, 3], [2, 3]]

Unnamed: 0,Name,Sex
elemento_2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
elemento_3,"Heikkinen, Miss. Laina",female
elemento_4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female


In [14]:
temp.ix[[1, 2, 3], ["Name", "Sex"]]

Unnamed: 0,Name,Sex
elemento_2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
elemento_3,"Heikkinen, Miss. Laina",female
elemento_4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female


In [15]:
del temp

In [16]:
#indexar por `slices`

data.iloc[:3]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S


In [17]:
data.iloc[-3:]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [18]:
data.loc[1:10, "Name":"Ticket"]

Unnamed: 0_level_0,Name,Sex,Age,SibSp,Parch,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599
3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803
5,"Allen, Mr. William Henry",male,35.0,0,0,373450
6,"Moran, Mr. James",male,,0,0,330877
7,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463
8,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909
9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742
10,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736


In [19]:
data[["Name", "Ticket"]]

Unnamed: 0_level_0,Name,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",A/5 21171
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599
3,"Heikkinen, Miss. Laina",STON/O2. 3101282
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803
5,"Allen, Mr. William Henry",373450
...,...,...
887,"Montvila, Rev. Juozas",211536
888,"Graham, Miss. Margaret Edith",112053
889,"Johnston, Miss. Catherine Helen ""Carrie""",W./C. 6607
890,"Behr, Mr. Karl Howell",111369


In [20]:
use_cols = ["Name", "Ticket"]
data[use_cols]

Unnamed: 0_level_0,Name,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",A/5 21171
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599
3,"Heikkinen, Miss. Laina",STON/O2. 3101282
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803
5,"Allen, Mr. William Henry",373450
...,...,...
887,"Montvila, Rev. Juozas",211536
888,"Graham, Miss. Margaret Edith",112053
889,"Johnston, Miss. Catherine Helen ""Carrie""",W./C. 6607
890,"Behr, Mr. Karl Howell",111369


In [21]:
data["Name"]

PassengerId
1                                Braund, Mr. Owen Harris
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
3                                 Heikkinen, Miss. Laina
4           Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                               Allen, Mr. William Henry
                             ...                        
887                                Montvila, Rev. Juozas
888                         Graham, Miss. Margaret Edith
889             Johnston, Miss. Catherine Helen "Carrie"
890                                Behr, Mr. Karl Howell
891                                  Dooley, Mr. Patrick
Name: Name, dtype: object

In [22]:
data[["Name"]]

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
1,"Braund, Mr. Owen Harris"
2,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,"Heikkinen, Miss. Laina"
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,"Allen, Mr. William Henry"
...,...
887,"Montvila, Rev. Juozas"
888,"Graham, Miss. Margaret Edith"
889,"Johnston, Miss. Catherine Helen ""Carrie"""
890,"Behr, Mr. Karl Howell"


In [23]:
data.Name

PassengerId
1                                Braund, Mr. Owen Harris
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
3                                 Heikkinen, Miss. Laina
4           Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                               Allen, Mr. William Henry
                             ...                        
887                                Montvila, Rev. Juozas
888                         Graham, Miss. Margaret Edith
889             Johnston, Miss. Catherine Helen "Carrie"
890                                Behr, Mr. Karl Howell
891                                  Dooley, Mr. Patrick
Name: Name, dtype: object

In [24]:
temp = data[["Name"]].copy()
temp.OtroNombre = ["OTRO_" + n for n in data.Name]
temp

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
1,"Braund, Mr. Owen Harris"
2,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,"Heikkinen, Miss. Laina"
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,"Allen, Mr. William Henry"
...,...
887,"Montvila, Rev. Juozas"
888,"Graham, Miss. Margaret Edith"
889,"Johnston, Miss. Catherine Helen ""Carrie"""
890,"Behr, Mr. Karl Howell"


In [25]:
temp.OtroNombre[:10]

['OTRO_Braund, Mr. Owen Harris',
 'OTRO_Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
 'OTRO_Heikkinen, Miss. Laina',
 'OTRO_Futrelle, Mrs. Jacques Heath (Lily May Peel)',
 'OTRO_Allen, Mr. William Henry',
 'OTRO_Moran, Mr. James',
 'OTRO_McCarthy, Mr. Timothy J',
 'OTRO_Palsson, Master. Gosta Leonard',
 'OTRO_Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
 'OTRO_Nasser, Mrs. Nicholas (Adele Achem)']

In [26]:
temp["OtroNombre"] = ["OTRO_" + n for n in data.Name]
temp

Unnamed: 0_level_0,Name,OtroNombre
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris","OTRO_Braund, Mr. Owen Harris"
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...","OTRO_Cumings, Mrs. John Bradley (Florence Brig..."
3,"Heikkinen, Miss. Laina","OTRO_Heikkinen, Miss. Laina"
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)","OTRO_Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,"Allen, Mr. William Henry","OTRO_Allen, Mr. William Henry"
...,...,...
887,"Montvila, Rev. Juozas","OTRO_Montvila, Rev. Juozas"
888,"Graham, Miss. Margaret Edith","OTRO_Graham, Miss. Margaret Edith"
889,"Johnston, Miss. Catherine Helen ""Carrie""","OTRO_Johnston, Miss. Catherine Helen ""Carrie"""
890,"Behr, Mr. Karl Howell","OTRO_Behr, Mr. Karl Howell"


In [27]:
del temp

In [28]:
data.iloc[1]

Survived                                                    1
Pclass                                                      1
Name        Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                    female
Age                                                     38.00
                                  ...                        
Parch                                                       0
Ticket                                               PC 17599
Fare                                                    71.28
Cabin                                                     C85
Embarked                                                    C
Name: 2, dtype: object

In [29]:
data.iloc[[1]]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C


In [30]:
data["NumFam"] = data.SibSp + data.Parch
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C,1
3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S,1
5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S,0
888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S,0
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,3
890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C,0


In [31]:
#otra forma de filtrar es con mascaras binarias (`boolean`)
data[data.SibSp > 0]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S,1
8,0,3,"Palsson, Master. Gosta Leonard",male,2.00,3,1,349909,21.07,,S,4
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.00,1,0,237736,30.07,,C,1
...,...,...,...,...,...,...,...,...,...,...,...,...
867,1,2,"Duran y More, Miss. Asuncion",female,27.00,1,0,SC/PARIS 2149,13.86,,C,1
870,1,3,"Johnson, Master. Harold Theodor",male,4.00,1,1,347742,11.13,,S,2
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.00,1,1,11751,52.55,D35,S,2
875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.00,1,0,P/PP 3381,24.00,,C,1


In [32]:
data[(data.SibSp > 0) & (data.Age < 18)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8,0,3,"Palsson, Master. Gosta Leonard",male,2.00,3,1,349909,21.07,,S,4
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.00,1,0,237736,30.07,,C,1
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.00,1,1,PP 9549,16.70,G6,S,2
17,0,3,"Rice, Master. Eugene",male,2.00,4,1,382652,29.12,,Q,5
25,0,3,"Palsson, Miss. Torborg Danira",female,8.00,3,1,349909,21.07,,S,4
...,...,...,...,...,...,...,...,...,...,...,...,...
831,1,3,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.00,1,0,2659,14.45,,C,1
832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,2
851,0,3,"Andersson, Master. Sigvard Harald Elias",male,4.00,4,2,347082,31.27,,S,6
853,0,3,"Boulos, Miss. Nourelain",female,9.00,1,1,2678,15.25,,C,2


### Ejercicio

###### seleccionar varones mayores de 65 años que viajan solos

In [None]:
# escribir la solucion aqui...


In [34]:
# %load soluciones/mayores_solos.py


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,0
97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.65,A5,C,0
117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q,0
281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q,0
457,0,1,"Millet, Mr. Francis Davis",male,65.0,0,0,13509,26.55,E38,S,0
494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5,,C,0
631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S,0
673,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S,0
852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.78,,S,0


### Filtrado de filas y columnas

Para eliminar lo que no quieren en lugar de seleccionar lo que sì

```
DataFrame.drop(etiquetas, axis=0, ...)

Parámetros
----------
etiquetas : etiqueta o lista de etiquetas
axis : entero o nombre de la dimesión
    - 0 / 'index', para eliminar filas
    - 1 / 'columns', para elimnar columnas
```

In [35]:
valid_index = np.random.choice(data.index, int(data.index.shape[0] * 0.1))
valid_index

array([296, 527, 831, 157, 349,  69, 606, 341, 359, 211, 840, 611, 432,
       822, 449, 210, 865, 328,   8, 445, 695, 788, 206, 513, 673, 391,
       525, 287, 673, 706, 545, 456, 455,  63, 720, 380, 135, 195, 832,
       663, 640, 375, 786, 868, 660, 149, 625, 124, 139, 326, 655, 262,
       639, 391, 295, 131, 221, 106, 162, 108, 847, 127, 657,  26, 630,
       778, 805, 550,  70, 126, 319, 449, 844,  39, 715, 884, 254, 180,
       192, 491, 710,  27, 408, 231, 736, 224,  21, 442, 154], dtype=int64)

In [36]:
train = data.drop(valid_index)
valid = data.loc[valid_index]
train

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C,1
3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S,1
5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S,0
888,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S,0
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,3
890,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C,0


In [37]:
valid

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
296,0,1,"Lewy, Mr. Ervin G",male,,0,0,PC 17612,27.72,,C,0
527,1,2,"Ridsdale, Miss. Lucy",female,50.00,0,0,W./C. 14258,10.50,,S,0
831,1,3,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.00,1,0,2659,14.45,,C,1
157,1,3,"Gilnagh, Miss. Katherine ""Katie""",female,16.00,0,0,35851,7.73,,Q,0
349,1,3,"Coutts, Master. William Loch ""William""",male,3.00,1,1,C.A. 37671,15.90,,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...
736,0,3,"Williams, Mr. Leslie",male,28.50,0,0,54636,16.10,,S,0
224,0,3,"Nenkoff, Mr. Christo",male,,0,0,349234,7.90,,S,0
21,0,2,"Fynney, Mr. Joseph J",male,35.00,0,0,239865,26.00,,S,0
442,0,3,"Hampe, Mr. Leon",male,20.00,0,0,345769,9.50,,S,0


In [38]:
X_train, y_train = train.drop("Survived", axis=1), train["Survived"]
X_valid, y_valid = valid.drop("Survived", axis=1), valid["Survived"]
X_train

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.25,,S,1
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.28,C85,C,1
3,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.92,,S,0
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.10,C123,S,1
5,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.05,,S,0
...,...,...,...,...,...,...,...,...,...,...,...
887,2,"Montvila, Rev. Juozas",male,27.00,0,0,211536,13.00,,S,0
888,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,112053,30.00,B42,S,0
889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,3
890,1,"Behr, Mr. Karl Howell",male,26.00,0,0,111369,30.00,C148,C,0


In [39]:
y_train

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, dtype: int64

### Agrupaciones y Tablas de Contingencia

#### Agrupaciones

Las agrupaciones sirven para hacer cálculos sobre subconjuntos de los datos, generalmente tienen tres partes:

1. Definir los grupos
2. Aplicar un cálculo
3. Combinar los resultados

In [40]:
#agrupar
agrupado = data.groupby("Pclass")
agrupado

<pandas.core.groupby.DataFrameGroupBy object at 0x000000000B26E898>

In [41]:
#sólo hemos agrupado, no se ha hecho ningún cálculo, para eso hay que aplicar alguna función
agrupado.Survived.mean()

Pclass
1   0.63
2   0.47
3   0.24
Name: Survived, dtype: float64

In [42]:
agrupado.Survived.agg({"media": "mean", "media_2": np.mean, "varianza": "var", "cantidad": "count"})

Unnamed: 0_level_0,media_2,cantidad,varianza,media
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.63,216,0.23,0.63
2,0.47,184,0.25,0.47
3,0.24,491,0.18,0.24


In [43]:
data.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'NumFam'],
      dtype='object')

In [44]:
data.groupby("Survived")[['Age', 'SibSp', 'Parch', 'NumFam', 'Fare']].mean()

Unnamed: 0_level_0,Age,SibSp,Parch,NumFam,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,30.63,0.55,0.33,0.88,22.12
1,28.34,0.47,0.46,0.94,48.4


#### Tablas de Contingencia

Las tablas de contingencia asemejan las tablas dinámicas de excel, sirven apra ver inteacciones entre variables

In [45]:
pd.crosstab(data.Pclass, data.Survived)

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


In [46]:
pd.crosstab(data.Pclass, data.Survived).apply(lambda x: x/x.sum(), axis=1)

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.37,0.63
2,0.53,0.47
3,0.76,0.24


In [47]:
data.Age.value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, dtype: int64

In [48]:
data.Age.value_counts(True).sort_index()

0.42    0.00
0.67    0.00
0.75    0.00
0.83    0.00
0.92    0.00
        ... 
70.00   0.00
70.50   0.00
71.00   0.00
74.00   0.00
80.00   0.00
Name: Age, dtype: float64

In [49]:
pd.crosstab(data.Pclass, pd.cut(data.Age, [i * 10 for i in range(9)]), values=data.Survived, aggfunc=np.mean)

Age,"(0, 10]","(10, 20]","(20, 30]","(30, 40]","(40, 50]","(50, 60]","(60, 70]","(70, 80]"
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0.67,0.83,0.72,0.76,0.57,0.6,0.18,0.33
2,1.0,0.5,0.41,0.44,0.53,0.17,0.33,
3,0.43,0.25,0.23,0.21,0.07,0.0,0.33,0.0


In [50]:
pd.crosstab(data.Pclass, pd.cut(data.Age, [i * 10 for i in range(9)]))

Age,"(0, 10]","(10, 20]","(20, 30]","(30, 40]","(40, 50]","(50, 60]","(60, 70]","(70, 80]"
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3,18,40,49,37,25,11,3
2,17,18,61,43,19,12,3,0
3,44,79,129,63,30,5,3,2


### Poniendo todo junto en un ejemplo de Data Mining

In [51]:
#hay variables que no son numericas y que hay que codificar antes que nada
tipos = data.dtypes
tipos

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
             ...   
Ticket       object
Fare        float64
Cabin        object
Embarked     object
NumFam        int64
dtype: object

In [52]:
tipos_objeto = tipos[tipos == "object"]
tipos_objeto

Name        object
Sex         object
Ticket      object
Cabin       object
Embarked    object
dtype: object

In [53]:
nulos = data.isnull().sum()
nulos

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
           ... 
Ticket        0
Fare          0
Cabin       687
Embarked      2
NumFam        0
dtype: int64

In [54]:
nulos[nulos > 0]

Age         177
Cabin       687
Embarked      2
dtype: int64

In [55]:
data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [56]:
data["Sex"] = data.Sex.apply(lambda x: {"male": 0, "female": 1}[x])
data["Sex"].value_counts()

0    577
1    314
Name: Sex, dtype: int64

In [57]:
data["Ticket"].unique().shape

(681,)

In [58]:
data["Ticket"].factorize()

(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,   7,  24,
         25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,
         38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,
         51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
         64,  65,  66,  67,  68,  69,  58,  70,  71,  72,  73,  74,  75,
         76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  26,  86,  87,
         88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
        101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
         40, 114,  13,  70, 115,   9, 116,  99,  38, 117, 118, 119, 120,
        121, 122, 123, 124, 125, 126, 127,   3, 128, 129, 130, 131, 132,
        133, 134, 135, 136,  84, 137, 138, 139, 140, 141, 142, 143, 144,
        145, 146, 147, 148, 149, 150, 151, 152,  49, 153, 154,  62, 155,
         72, 156,  16,   8, 157, 158, 159, 160, 161

In [59]:
data["Ticket"] = data["Ticket"].factorize()[0]
data["Ticket"].value_counts()

72     7
13     7
148    7
62     6
49     6
      ..
445    1
444    1
443    1
442    1
0      1
Name: Ticket, dtype: int64

In [60]:
data.Embarked.fillna(-1).value_counts()

S     644
C     168
Q      77
-1      2
Name: Embarked, dtype: int64

In [61]:
data[data.Embarked.isnull()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NumFam
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
62,1,1,"Icard, Miss. Amelie",1,38.0,0,0,60,80.0,B28,,0
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",1,62.0,0,0,60,80.0,B28,,0


In [62]:
data[(data.Fare >= 70) & (data.Fare <= 90)].Embarked.value_counts()

S    25
C    19
Q     2
Name: Embarked, dtype: int64

In [63]:
data.Embarked.fillna("S", inplace=True)
data.Embarked.fillna(-1).value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [64]:
pd.crosstab(data.Embarked, data.Survived)

Survived,0,1
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,75,93
Q,47,30
S,427,219


In [65]:
pd.crosstab(data.Embarked, data.Survived).apply(lambda x: x/x.sum(), axis=1)

Survived,0,1
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,0.45,0.55
Q,0.61,0.39
S,0.66,0.34


In [66]:
pd.get_dummies(data.Embarked)

Unnamed: 0_level_0,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,0,1
2,1,0,0
3,0,0,1
4,0,0,1
5,0,0,1
...,...,...,...
887,0,0,1
888,0,0,1
889,0,0,1
890,1,0,0


In [67]:
data = data.join(pd.get_dummies(data.Embarked)).drop("Embarked", axis=1)

In [68]:
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,NumFam,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,0,7.25,,1,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,1,71.28,C85,1,1,0,0
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,2,7.92,,0,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,3,53.10,C123,1,0,0,1
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,4,8.05,,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,677,13.00,,0,0,0,1
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,678,30.00,B42,0,0,0,1
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,614,23.45,,3,0,0,1
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,679,30.00,C148,0,1,0,0


In [69]:
data.Cabin.fillna(-1).value_counts()

-1             687
B96 B98          4
G6               4
C23 C25 C27      4
F2               3
              ... 
C49              1
C101             1
D28              1
E10              1
C103             1
Name: Cabin, dtype: int64

In [70]:
data["Cabin"] = data.Cabin.fillna(-1).factorize()[0]

In [71]:
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,NumFam,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,0,7.25,0,1,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,1,71.28,1,1,1,0,0
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,2,7.92,0,0,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,3,53.10,2,1,0,0,1
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,4,8.05,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,677,13.00,0,0,0,0,1
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,678,30.00,146,0,0,0,1
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,614,23.45,0,3,0,0,1
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,679,30.00,147,0,1,0,0


In [72]:
data.Age.fillna(-1).value_counts()

-1.00    177
24.00     30
22.00     27
18.00     26
28.00     25
        ... 
36.50      1
55.50      1
66.00      1
23.50      1
0.42       1
Name: Age, dtype: int64

In [73]:
pd.crosstab(data.Pclass, data.Age.isnull()).apply(lambda x: x/x.sum(), axis=1)

Age,False,True
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.86,0.14
2,0.94,0.06
3,0.72,0.28


In [74]:
data["Age_nul"] = data.Age.isnull().astype(int)
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,NumFam,C,Q,S,Age_nul
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,0,7.25,0,1,0,0,1,0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,1,71.28,1,1,1,0,0,0
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,2,7.92,0,0,0,0,1,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,3,53.10,2,1,0,0,1,0
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,4,8.05,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,677,13.00,0,0,0,0,1,0
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,678,30.00,146,0,0,0,1,0
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,614,23.45,0,3,0,0,1,1
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,679,30.00,147,0,1,0,0,0


In [75]:
data.Age.fillna(data.Age.mean(), inplace=True)
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,NumFam,C,Q,S,Age_nul
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"Braund, Mr. Owen Harris",0,22.00,1,0,0,7.25,0,1,0,0,1,0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.00,1,0,1,71.28,1,1,1,0,0,0
3,1,3,"Heikkinen, Miss. Laina",1,26.00,0,0,2,7.92,0,0,0,0,1,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.00,1,0,3,53.10,2,1,0,0,1,0
5,0,3,"Allen, Mr. William Henry",0,35.00,0,0,4,8.05,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",0,27.00,0,0,677,13.00,0,0,0,0,1,0
888,1,1,"Graham, Miss. Margaret Edith",1,19.00,0,0,678,30.00,146,0,0,0,1,0
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,29.70,1,2,614,23.45,0,3,0,0,1,1
890,1,1,"Behr, Mr. Karl Howell",0,26.00,0,0,679,30.00,147,0,1,0,0,0


In [76]:
data.isnull().sum().sum()

0

In [77]:
data.drop("Name", axis=1, inplace=True)
data

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,NumFam,C,Q,S,Age_nul
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,0,22.00,1,0,0,7.25,0,1,0,0,1,0
2,1,1,1,38.00,1,0,1,71.28,1,1,1,0,0,0
3,1,3,1,26.00,0,0,2,7.92,0,0,0,0,1,0
4,1,1,1,35.00,1,0,3,53.10,2,1,0,0,1,0
5,0,3,0,35.00,0,0,4,8.05,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,0,27.00,0,0,677,13.00,0,0,0,0,1,0
888,1,1,1,19.00,0,0,678,30.00,146,0,0,0,1,0
889,0,3,1,29.70,1,2,614,23.45,0,3,0,0,1,1
890,1,1,0,26.00,0,0,679,30.00,147,0,1,0,0,0


In [78]:
data.dtypes.value_counts()

int64      8
uint8      3
float64    2
int32      1
dtype: int64

In [79]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 14 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null int64
Fare        891 non-null float64
Cabin       891 non-null int64
NumFam      891 non-null int64
C           891 non-null uint8
Q           891 non-null uint8
S           891 non-null uint8
Age_nul     891 non-null int32
dtypes: float64(2), int32(1), int64(8), uint8(3)
memory usage: 82.7 KB


In [80]:
valid_index

array([296, 527, 831, 157, 349,  69, 606, 341, 359, 211, 840, 611, 432,
       822, 449, 210, 865, 328,   8, 445, 695, 788, 206, 513, 673, 391,
       525, 287, 673, 706, 545, 456, 455,  63, 720, 380, 135, 195, 832,
       663, 640, 375, 786, 868, 660, 149, 625, 124, 139, 326, 655, 262,
       639, 391, 295, 131, 221, 106, 162, 108, 847, 127, 657,  26, 630,
       778, 805, 550,  70, 126, 319, 449, 844,  39, 715, 884, 254, 180,
       192, 491, 710,  27, 408, 231, 736, 224,  21, 442, 154], dtype=int64)

In [81]:
X_train, y_train = data.drop(valid_index).drop("Survived", axis=1), data.drop(valid_index).Survived
X_valid, y_valid = data.loc[valid_index].drop("Survived", axis=1), data.loc[valid_index, "Survived"]

In [82]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import ParameterGrid

from sklearn.metrics import roc_auc_score, accuracy_score



In [89]:
accuracy_score?

In [84]:
learners = [
    {
        "learner": LogisticRegression,
        "params": {
            "C": [0.0001, 0.01, 1, 10, 100]
        }
    },
    {
        "learner": RandomForestClassifier,
        "params": {
            "max_depth": [4, 10, 20, None],
            "n_estimators": [10, 100]
        }
    }
]

In [90]:
pd.options.display.max_rows = 20
resultados = []
for candidate in learners:
    for params in ParameterGrid(candidate["params"]):
        learner = candidate["learner"](**params)
        learner.fit(X_train, y_train)
        probs = learner.predict_proba(X_valid)[:, -1]
        if hasattr(learner, "feature_importances_"):
            print("importancia", params)
            print(pd.Series(learner.feature_importances_, index=X_train.columns))
            print("*"*10)
        elif hasattr(learner, "coef_"):
            print("coefs", params)
            print(pd.Series(learner.coef_[0], index=X_train.columns))
            print("*"*10)
        res = {
            "learner": learner.__class__.__name__,
            "parmas": params,
            "area": roc_auc_score(y_valid, probs),
            "accuracy": accuracy_score(y_valid, probs >= 0.5)
        }
        resultados.append(res)
        

coefs {'C': 0.0001}
Pclass    -0.01
Sex        0.01
Age       -0.02
SibSp     -0.01
Parch     -0.00
Ticket    -0.00
Fare       0.01
Cabin      0.01
NumFam    -0.01
C          0.00
Q          0.00
S         -0.00
Age_nul   -0.00
dtype: float64
**********
coefs {'C': 0.01}
Pclass    -0.18
Sex        0.67
Age       -0.02
SibSp     -0.12
Parch      0.05
Ticket    -0.00
Fare       0.01
Cabin      0.01
NumFam    -0.06
C          0.06
Q          0.05
S         -0.08
Age_nul   -0.05
dtype: float64
**********
coefs {'C': 1}
Pclass    -0.81
Sex        2.78
Age       -0.03
SibSp     -0.21
Parch      0.04
Ticket    -0.00
Fare       0.00
Cabin      0.01
NumFam    -0.17
C          0.42
Q          0.61
S          0.07
Age_nul   -0.36
dtype: float64
**********
coefs {'C': 10}
Pclass    -0.92
Sex        2.90
Age       -0.04
SibSp     -0.21
Parch      0.03
Ticket    -0.00
Fare       0.00
Cabin      0.01
NumFam    -0.19
C          0.49
Q          0.78
S          0.13
Age_nul   -0.41
dtype: float64
******

In [87]:
resultados = pd.DataFrame.from_dict(resultados)[["learner", "parmas", "accuracy", "area"]]
resultados

Unnamed: 0,learner,parmas,accuracy,area
0,LogisticRegression,{'C': 0.0001},0.62,0.69
1,LogisticRegression,{'C': 0.01},0.64,0.76
2,LogisticRegression,{'C': 1},0.71,0.75
3,LogisticRegression,{'C': 10},0.71,0.75
4,LogisticRegression,{'C': 100},0.71,0.75
5,RandomForestClassifier,"{'max_depth': 4, 'n_estimators': 10}",0.74,0.74
6,RandomForestClassifier,"{'max_depth': 4, 'n_estimators': 100}",0.73,0.76
7,RandomForestClassifier,"{'max_depth': 10, 'n_estimators': 10}",0.75,0.78
8,RandomForestClassifier,"{'max_depth': 10, 'n_estimators': 100}",0.79,0.8
9,RandomForestClassifier,"{'max_depth': 20, 'n_estimators': 10}",0.76,0.82


In [88]:
resultados.sort_values("area", ascending=False)

Unnamed: 0,learner,parmas,accuracy,area
9,RandomForestClassifier,"{'max_depth': 20, 'n_estimators': 10}",0.76,0.82
12,RandomForestClassifier,"{'max_depth': None, 'n_estimators': 100}",0.79,0.82
10,RandomForestClassifier,"{'max_depth': 20, 'n_estimators': 100}",0.78,0.81
8,RandomForestClassifier,"{'max_depth': 10, 'n_estimators': 100}",0.79,0.8
11,RandomForestClassifier,"{'max_depth': None, 'n_estimators': 10}",0.76,0.8
7,RandomForestClassifier,"{'max_depth': 10, 'n_estimators': 10}",0.75,0.78
6,RandomForestClassifier,"{'max_depth': 4, 'n_estimators': 100}",0.73,0.76
1,LogisticRegression,{'C': 0.01},0.64,0.76
4,LogisticRegression,{'C': 100},0.71,0.75
2,LogisticRegression,{'C': 1},0.71,0.75
