# Python Notes for Artificial Intelligence and Maching Learning

Debemos de pensar en recompensas que justamente se alinen a justo lo que queremos _cumplir_.

## 1. Python


### 1.1 Librerías 

Para importar una librería, la sintáxis es *import libreria* y puedes poner un alias opcional para acortar todo, también puedes importar solo módulos o funciones de una librería: from *libreria import funcion*.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import set_style

#### While

In [None]:
"""
La estructura de While es la siguiente:

while condition:
    something
"""
# https://es.wikipedia.org/wiki/Serie_de_Leibniz
i = 0
pi_approximation = 0
pi_approximations = []
while np.abs(4*pi_approximation-np.pi)/np.pi >= 0.000001: #Umbral de precisión
    pi_approximation += ((-1)**i) / ((2*i)+1)
    pi_approximations.append(4*pi_approximation)
    #residuo = np.pi-4*pi_approximation
    i+=1
print(f'Pi Approximation: {round(4*pi_approximation, 5)}')
print(f'Iterations: {len(pi_approximations)}')

Pi Approximation: 3.14159
Iterations: 318310


#### Listas

Son arreglos de objetos en Python. Pueden tener cualquier objeto, la palabra "list" está reservada y justo es una función para crear una lista a partir de un iterable




In [None]:
my_list = ['c',2, True, print, [5,6]] 
my_list[0]

'c'

In [None]:
my_list[2:-1] # toma los elementos del índice 1 hasta (:) el último (-1) exclusivo

[True, <function print>]

In [None]:
# Se pueden anidar listas y así obtener matrices, primer[i] representa el elemento de la lista 
# y el segundo[k] representa al elemento k del elemento i
my_list[4][1]

6

### 1.2 Pandas
import pandas as pd

In [13]:
import pandas as pd


#### DataFrame
Un *DataFrame* es un objeto de pandas que permite manipular datos en tablas, es un un arreglo que contiene entradas individuales o registros con valores determinados. Cada registro corresponde a una celda de una fila y columna determinada.

In [14]:
pd.DataFrame({'col1': ['1', '3'], 
              'col2': ['2', '4']},
             index=['row1', 'row2'])

Unnamed: 0,col1,col2
row1,1,2
row2,3,4


Por otro lado, una Serie(o Series) de pandas es una secuencia de valores. Si un DataFrame es una tabla, una Serie es una _*lista*_ que puede inicializarse utilizando únicamente una lista de Python.

In [15]:
pd.Series([213000, 223411, 236012], index=['envios_2015', 'envios_2016', 'envios_2017'], name='producto_x')

envios_2015    213000
envios_2016    223411
envios_2017    236012
Name: producto_x, dtype: int64

In [16]:
pd.Series([1,2,3,4,5,6])

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

#### Lectura de archivos

El método read_csv recibe varios parámetros que permiten manejar particularidades del archivo como separadores, encabezados, valores nulos, etc.
 
  - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [17]:
# https://archive.ics.uci.edu/ml/datasets/Adult
cols = ['age','workclass','fnlwgt','education','education_level','marital','occupation','relationship','race','sex','capital_gain','capital_loss','hours','native','target']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)
df.columns = cols
df.head(4)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K


In [18]:
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [19]:
df.info

<bound method DataFrame.info of        age          workclass  fnlwgt    education  education_level  \
0       39          State-gov   77516    Bachelors               13   
1       50   Self-emp-not-inc   83311    Bachelors               13   
2       38            Private  215646      HS-grad                9   
3       53            Private  234721         11th                7   
4       28            Private  338409    Bachelors               13   
...    ...                ...     ...          ...              ...   
32556   27            Private  257302   Assoc-acdm               12   
32557   40            Private  154374      HS-grad                9   
32558   58            Private  151910      HS-grad                9   
32559   22            Private  201490      HS-grad                9   
32560   52       Self-emp-inc  287927      HS-grad                9   

                   marital          occupation    relationship    race  \
0            Never-married        Adm-cle

En Python podemos acceder al atributo de un objeto utilizando la sintaxis `objeto.atributo`, aquí podemos hacer lo mismo para acceder a una de las columnas

In [20]:
# El atributo `shape` de un DataFrame nos muestra por cuántas filas y columnas esta formado
df.sex

0           Male
1           Male
2           Male
3           Male
4         Female
          ...   
32556     Female
32557       Male
32558     Female
32559       Male
32560     Female
Name: sex, Length: 32561, dtype: object

In [21]:
df['education_level'].tail()

32556    12
32557     9
32558     9
32559     9
32560     9
Name: education_level, dtype: int64

In [22]:
type(df.education_level)

pandas.core.series.Series

Otra forma de acceder a las columnas, con la opción de admitir carácteres especiales.

In [31]:
df.education_level[1]

13

In [25]:
df['education_level'][1]

13

Pandas cuenta con sus propios métodos para acceder a los elementos en un DataFrame:

+ Selección por índice iloc
+ Selección por etiqueta loc

Ambos métodos reciben primero el número de fila y después el número de columna, es decir: loc[fila,columna] o iloc[fila,columna]#

##### `loc`: Esta forma de selección utiliza el **valor actual del índice** y no su posición.

In [45]:
# seleccionamos por índice las primeras 3 filas
df.loc[0, 'education_level']

13

In [46]:
# selecciona todas las filas y las columnas determinadas por la lista.
df.loc[:,['age', 'workclass', 'education']]

Unnamed: 0,age,workclass,education
0,39,State-gov,Bachelors
1,50,Self-emp-not-inc,Bachelors
2,38,Private,HS-grad
3,53,Private,11th
4,28,Private,Bachelors
...,...,...,...
32556,27,Private,Assoc-acdm
32557,40,Private,HS-grad
32558,58,Private,HS-grad
32559,22,Private,HS-grad


##### `iloc` también admite notación slicing para realizar la selección de datos.


In [51]:
# selecciona todas las filas y solo la primera columna (Serie)
df.iloc[:, [0,1]] # == df.iloc[:, ['age','iloc']]

Unnamed: 0,age,workclass
0,39,State-gov
1,50,Self-emp-not-inc
2,38,Private
3,53,Private
4,28,Private
...,...,...
32556,27,Private
32557,40,Private
32558,58,Private
32559,22,Private


In [54]:
# selecciona las primeras 4 filas de la primera columna
df.iloc[:4, 0]

0    39
1    50
2    38
3    53
Name: age, dtype: int64

In [56]:
# selecciona las últimas 5 filas de la segunda columna
df.iloc[-5:, 1]

32556          Private
32557          Private
32558          Private
32559          Private
32560     Self-emp-inc
Name: workclass, dtype: object

In [58]:
# selecciona los índices de filas determinados por una lista, de la columna 0
df.iloc[[2,4,6], 0]

2    38
4    28
6    49
Name: age, dtype: int64

#### Modificar índices

In [60]:
# es importante elegir una columna adecuada como índice, sin embargo este ejemplo no es bueno
df.set_index('education').head()

Unnamed: 0_level_0,age,workclass,fnlwgt,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Bachelors,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
Bachelors,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
HS-grad,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
11th,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
Bachelors,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### Selección condicional 

Con frecuencia es necesario filtrar datos con base en sus valores

In [66]:
df.hours>30

0         True
1        False
2         True
3         True
4         True
         ...  
32556     True
32557     True
32558     True
32559    False
32560     True
Name: hours, Length: 32561, dtype: bool

El resultado es una Serie indicando si el valor correspondiente cumple la condición o no. En la siguiente celda podemos ver la forma de seleccionar y obtener un DataFrame como resultado:

In [68]:
df.loc[df.hours > 30].head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [72]:
# Para seleccionar registros utilizando más de un criterio:
df.loc[(df.hours > 30) & (df.education_level >= 13)].head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [77]:

df.loc[(df.hours > 1) & (df.education == 'Masters')].head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target


` & ` representa el operador AND mientras que `|` representa al operador OR  
*Nota*: los criterios de selección pueden ser sobre diferentes columnas.

Pandas tiene varios operadores de selección, algunos útiles son: `is_in` y `notnull()`

In [83]:
df.loc[df.education.isin(['Masters','Bachelors'])].head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target


In [81]:
type(df.education[0])

str

In [82]:
type('Bachelors')

str

In [88]:
# Filtrando registros y obteniendo una muestra
df.loc[df.marital.notnull()].sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
20008,47,Private,386136,Assoc-acdm,12,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,>50K
23972,47,Private,365516,Masters,14,Divorced,Prof-specialty,Unmarried,White,Female,0,0,40,United-States,<=50K
15208,20,Private,153516,Some-college,10,Never-married,Adm-clerical,Own-child,White,Male,0,0,30,United-States,<=50K
18856,37,Private,201259,11th,7,Divorced,Transport-moving,Not-in-family,White,Male,0,0,65,United-States,<=50K
26528,47,Private,456661,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,Mexico,<=50K


#### Ordenando por una columna

In [90]:
df.sort_values('hours', ascending=True).head()


Unnamed: 0,age,workclass,fnlwgt,education,education_level,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours,native,target
19750,23,Private,72887,HS-grad,9,Never-married,Craft-repair,Own-child,Asian-Pac-Islander,Male,0,0,1,Vietnam,<=50K
25078,74,Private,260669,10th,6,Divorced,Other-service,Not-in-family,White,Female,0,0,1,United-States,<=50K
11451,27,Private,147951,HS-grad,9,Never-married,Machine-op-inspct,Other-relative,White,Male,0,0,1,United-States,<=50K
8447,67,?,244122,Assoc-voc,11,Widowed,?,Not-in-family,White,Female,0,0,1,United-States,<=50K
32525,81,?,120478,Assoc-voc,11,Divorced,?,Unmarried,White,Female,0,0,1,?,<=50K


  #### Operaciones por columna
  
  std, mean, median, std, dev, sum


In [92]:
#std, mean, median, std, dev, sum
df['age'].median()

37.0

In [95]:
#std, mean, median, std, dev, sum
df['age'].std()

13.640432553581341

  #### Agrupaciones



In [99]:
df.groupby('native')['education_level'].mean()

native
 ?                             10.598628
 Cambodia                       8.789474
 Canada                        10.652893
 China                         11.120000
 Columbia                       9.372881
 Cuba                           9.600000
 Dominican-Republic             7.114286
 Ecuador                        9.464286
 El-Salvador                    6.839623
 England                       11.011111
 France                        12.241379
 Germany                       10.985401
 Greece                         9.724138
 Guatemala                      6.031250
 Haiti                          8.931818
 Holand-Netherlands            10.000000
 Honduras                       8.692308
 Hong                          10.600000
 Hungary                       10.769231
 India                         12.430000
 Iran                          12.395349
 Ireland                       10.083333
 Italy                          8.849315
 Jamaica                        9.851852
 Japan   

## 4. Matplotlib y Seaborn

## 5. Scikit-Learn

## Referencias
+ [Documentación de pandas](https://pandas.pydata.org/docs/)