<a href="https://colab.research.google.com/github/Josh1313/-Python_Basics-_1-/blob/main/Clase_5_PandaS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Basic 5

# Pandas

Pandas es una librería open source Python para realizar análisis de datos.
Aunque que Python es ya de por si un buen lenguaje para la preparación de datos, nunca ha sido excelente para realizar análisis (tendencia de usar R, SQL o incluso Excel para realizar este tipo de tareas), hasta la llegada de Pandas!

Pandas ofrece:
- Alta performance,
- Estructura de datos fácil de usar, en formato tablas (parecido a Series y Data Frames en R)
- Excelentes herramientas para realizar análisis de datos: reformatear, concatenar, agregar, ordenar, segmentar etc.
- Tratamiento para missing data (o valores nulos)

## Estructuras de datos en Pandas

Pandas introduce dos nuevas estructura de datos a Python:
- Series - estructuras 1d
- DataFrames - estructuras nd
ambas construidas sobre NumPy; lo que las hace dos estructura de datos rápidas.

![title](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)

|Estructura de Datos|Dimensiones|Descripción|
|---|---|:---|
|Series	|1|Array homogéneo 1D etiquetado, inmutable en tamaño|
|Data Frames|2|Estructura tabular 2D etiquetada, modificable en tamaño y con columnas potencialmente heterogéneas|

# Series en Pandas

**Definición**
Una Serie es un array 1-d que contiene todo tipo de datos (enteros, texto, decimals, objetos de Python, etc.). Vendría a ser una columna de un excel.

Las etiquetas de los elementos reciben el nombre de índice y cada elemento de una serie tienen asignado un índice etiquetado.

Por default, cada elemento tiene una etiqueta de índice numérica, de 0 a N.

## Generación de Series: pd.Series()

In [1]:
import pandas as pd

In [2]:
nombres = ["Elena", "Marcos", " Judith", "Sofía", "Estefanía"]

In [3]:
Series = pd.Series(nombres)

In [4]:
Series[2]

' Judith'

Pero **siempre estamos a tiempo de asignar una lista de etiquetas a los valores de nuestra Serie**

In [5]:
lista_estudiantes = ["Estudiante 1", "Estudiante 2","Estudiante 3","Estudiante 4","Estudiante 5"]

In [6]:
pd.Series(nombres, index = lista_estudiantes)

Estudiante 1        Elena
Estudiante 2       Marcos
Estudiante 3       Judith
Estudiante 4        Sofía
Estudiante 5    Estefanía
dtype: object

In [7]:
Series

0        Elena
1       Marcos
2       Judith
3        Sofía
4    Estefanía
dtype: object

In [8]:
Series[0]

'Elena'

### Generar una Serie a partir de un *ndarray()*

In [9]:
import numpy as np

In [10]:
array_5_items = np.random.rand(5)
array_5_items

array([0.56908486, 0.39246089, 0.09399233, 0.44755   , 0.77452281])

In [11]:
pd.Series(array_5_items)

0    0.569085
1    0.392461
2    0.093992
3    0.447550
4    0.774523
dtype: float64

#### Especificando etiquetas

In [12]:
lista_5_indices = ['e','e','e','e','e']

print(lista_5_indices)

['e', 'e', 'e', 'e', 'e']


In [13]:
pd.Series(array_5_items,index = lista_5_indices)

e    0.569085
e    0.392461
e    0.093992
e    0.447550
e    0.774523
dtype: float64

In [14]:
lista_5_indices = ['A','A','e','e','e']

In [15]:
print(lista_5_indices)

['A', 'A', 'e', 'e', 'e']


In [16]:
pd.Series(array_5_items,index = lista_5_indices)

A    0.569085
A    0.392461
e    0.093992
e    0.447550
e    0.774523
dtype: float64

### Generar una serie a partir de un diccionario

In [17]:
mi_diccionario = {'a':1, 'b':2, 'c':3,'d':4 }

In [18]:
pd.Series(mi_diccionario)

a    1
b    2
c    3
d    4
dtype: int64

In [19]:
mi_diccionario = {'a':1, 'b':2, 'c':3}

In [20]:
pd.Series(mi_diccionario)

a    1
b    2
c    3
dtype: int64

## Básicos de Series

### Índices y Valores

In [21]:
d = {'Barcelona': 1000, 'Nueva York': 1300, 'Madrid': 900, 'San Francisco': 1100,
     'Valencia': 450, 'Boston': None}
ciudades = pd.Series(d)
ciudades

Barcelona        1000.0
Nueva York       1300.0
Madrid            900.0
San Francisco    1100.0
Valencia          450.0
Boston              NaN
dtype: float64

In [22]:
ciudades.values

array([1000., 1300.,  900., 1100.,  450.,   nan])

In [23]:
ciudades.index

Index(['Barcelona', 'Nueva York', 'Madrid', 'San Francisco', 'Valencia',
       'Boston'],
      dtype='object')

In [24]:
ciudades.index = ['a', 'b', 'c', 'd', 'e', 'f']

In [25]:
ciudades

a    1000.0
b    1300.0
c     900.0
d    1100.0
e     450.0
f       NaN
dtype: float64

##### Ejercicio
- Generar una serie a partir del diccionario
- Sacar los index de la serie
- Sacar los valores de la serie
- Modificar el nombre de las etiquetas por letras de a- d

In [26]:
d = {'Barcelona': 1000, 'Nueva York': 1300, 'Madrid': 900, 'San Francisco': 1100,
     'Valencia': 450, 'Boston': None}

In [27]:
d = {'Barcelona': 1000, 'Nueva York': 1300, 'Madrid': 900, 'San Francisco': 1100, 'Valencia': 450, 'Boston': None}
s = pd.Series(d)


In [28]:
print(s.index)


Index(['Barcelona', 'Nueva York', 'Madrid', 'San Francisco', 'Valencia',
       'Boston'],
      dtype='object')


In [29]:
print(s.values)


[1000. 1300.  900. 1100.  450.   nan]


In [30]:
s = s.rename({'Barcelona': 'a', 'Nueva York': 'b', 'Madrid': 'c', 'San Francisco': 'd', 'Valencia': 'e'})


### Acceder a Elementos

In [31]:
import numpy as np
import pandas as pd

In [32]:
d = {'Nueva York': 1000, 'Nueva York': 1300, 'Madrid': 900, 'San Francisco': 1100,
     'Valencia': 450, 'Boston': None}

In [33]:
mi_serie = pd.Series(data = d)

In [34]:
mi_serie

Nueva York       1300.0
Madrid            900.0
San Francisco    1100.0
Valencia          450.0
Boston              NaN
dtype: float64

In [35]:
mi_serie['Madrid']

900.0

In [36]:
mi_serie[1]

900.0

### Cambiar Valores

In [37]:
d = {'Barcelona': 1000, 'Nueva York': 1300, 'Madrid': 900, 'San Francisco': 1100,
     'Valencia': 450, 'Boston': None}
ciudades = pd.Series(d)
ciudades

Barcelona        1000.0
Nueva York       1300.0
Madrid            900.0
San Francisco    1100.0
Valencia          450.0
Boston              NaN
dtype: float64

In [38]:
ciudades.update({'Barcelona': 2000, 'Nueva York': 2600, 'Madrid': 1800, 'San Francisco': 2200, 'Valencia': 900, 'Boston': 800})

In [39]:
ciudades

Barcelona        2000.0
Nueva York       2600.0
Madrid           1800.0
San Francisco    2200.0
Valencia          900.0
Boston            800.0
dtype: float64

In [40]:
ciudades['Barcelona'] = 3000

### Comprobar si existe un Elemento

In [41]:
ciudades

Barcelona        3000.0
Nueva York       2600.0
Madrid           1800.0
San Francisco    2200.0
Valencia          900.0
Boston            800.0
dtype: float64

In [42]:
if 'Barcelona' in ciudades:
    print("Barcelona existe en la serie ciudades")
else:
    print("Barcelona no existe en la serie ciudades")


Barcelona existe en la serie ciudades


In [43]:
'Barcelona' in ciudades

True

In [44]:
# lo mis mo para cada ciudad

*italicized text*#### Funciones vectorizadas

In [45]:
import numpy as np
import pandas as pd

In [46]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [47]:
a = pd.Series(np.random.randn(5), index=['h', 'b', 'c', 'd', 'e'])


In [48]:
suma = s + a
print(suma)

a         NaN
b    1.238351
c   -0.115055
d    1.590101
e    0.888797
h         NaN
dtype: float64


In [49]:
s + s

a    2.864163
b    3.513170
c   -0.532274
d    2.378393
e    3.016752
dtype: float64

## Slice de una Serie

In [50]:
import pandas as pd

In [51]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

In [52]:
slice_s = s[1:4]
print(slice_s)

b    2
c    3
d    4
dtype: int64


In [53]:
mi_serie[:2]

Nueva York    1300.0
Madrid         900.0
dtype: float64

### Por Etiquetas

In [54]:
import numpy as np
import pandas as pd

In [55]:
array = [1.0,2.0,3.0, 4.0]
index = ['a', 'b', 'c', 'd']
mi_serie = pd.Series(data = array, index = index)
mi_serie

a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

In [56]:
elem_a = mi_serie['a']
print(elem_a)

1.0


In [57]:
elem_ac = mi_serie[['a', 'c']]
print(elem_ac)


a    1.0
c    3.0
dtype: float64


In [58]:
elem_b = mi_serie[mi_serie.index.str.startswith('b')]
print(elem_b)


b    2.0
dtype: float64


In [59]:
mi_serie[:2]

a    1.0
b    2.0
dtype: float64

### Por Índices

In [60]:
s = pd.Series(np.random.randn(10), ['a','b','c','d','e','f','g','h','i','j'])
s

a   -2.127067
b    1.969206
c   -1.728402
d    1.358355
e   -1.062877
f    0.755361
g   -0.029281
h   -0.647350
i   -0.703604
j    0.731319
dtype: float64

In [61]:
s[1]

1.969206283646305

In [62]:
elems_135 = s[[1, 3, 5]]
print(elems_135)


b    1.969206
d    1.358355
f    0.755361
dtype: float64


#### Por condiciones Booleanas sobre los Valores

In [63]:
s

a   -2.127067
b    1.969206
c   -1.728402
d    1.358355
e   -1.062877
f    0.755361
g   -0.029281
h   -0.647350
i   -0.703604
j    0.731319
dtype: float64

In [64]:
s_gt0 = s[s > 0]


In [65]:
print(s_gt0)

b    1.969206
d    1.358355
f    0.755361
j    0.731319
dtype: float64


In [66]:
s[s < s.mean()]

a   -2.127067
c   -1.728402
e   -1.062877
h   -0.647350
i   -0.703604
dtype: float64

# RESUMEN de métodos sobre Series

## Métodos matemáticos

|Método|Description|
|:---|:---|
|add()|Method is used to add series or list like objects with same length to the caller series|
|sub()|Method is used to subtract series or list like objects with same length from the caller series|
|mul()|Method is used to multiply series or list like objects with same length with the caller series|
|div()|Method is used to divide series or list like objects with same length by the caller series|
|sum()|Returns the sum of the values for the requested axis|
|prod()|Returns the product of the values for the requested axis|
|mean()|Returns the mean of the values for the requested axis|
|pow()|Method is used to put each element of passed series as exponential power of caller series and returned the results
|abs()|Method is used to get the absolute numeric value of each element in Series/DataFrame||
|cov()|Method is used to find covariance of two series|

## Métodos exploratorios

<style> table {display: block} </style>
|Función|Descripción|
|---|------|
|combine_first()|Method is used to combine two series into one|
|count()|Returns number of non-NA/null observations in the Series|
|size()|Returns the number of elements in the underlying data|
|name()|Method allows to give a name to a Series object, i.e. to the column|
|is_unique()|Method returns boolean if values in the object are unique|
|idxmax()|Method to extract the index positions of the highest values in a Series|
|idxmin()|Method to extract the index positions of the lowest values in a Series|
|sort_values()|Method is called on a Series to sort the values in ascending or descending order|
|sort_index()|Method is called on a pandas Series to sort it by the index instead of its values|
|head()|Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series|
|tail()|Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series|
|le()|Used to compare every element of Caller series with passed series.It returns True for every element which is Less than or Equal to the element in passed series|
|ne()|Used to compare every element of Caller series with passed series. It returns True for every element which is Not Equal to the element in passed series|
|ge()|Used to compare every element of Caller series with passed series. It returns True for every element which is Greater than or Equal to the element in passed series|
|eq()|Used to compare every element of Caller series with passed series. It returns True for every element which is Equal to the element in passed series|
|gt()|Used to compare two series and return Boolean value for every respective element|
|lt()|Used to compare two series and return Boolean value for every respective element|
|clip()|Used to clip value below and above to passed Least and Max value|
|clip_lower()|Used to clip values below a passed least value|
|clip_upper()|Used to clip values above a passed maximum value|
|astype()|Method is used to change data type of a series|
|tolist()|Method is used to convert a series to list|
|get()|Method is called on a Series to extract values from a Series. This is alternative syntax to the traditional bracket syntax|
|unique()|Pandas unique() is used to see the unique values in a particular column|
|nunique()|Pandas nunique() is used to get a count of unique values|
|value_counts()|Method to count the number of the times each unique value occurs in a Series|
|factorize()|Method helps to get the numeric representation of an array by identifying distinct values|
|map()|Method to tie together the values from one object to another|
|between()|Pandas between() method is used on series to check which values lie between first and second argument|
|apply()|Method is called and feeded a Python function as an argument to use the function on every Series value. This method is helpful for executing custom operations that are not included in pandas or numpy|

Practiquemos algunos métodos exploratorios muy utilizados en la manipulación de datos.

In [67]:
import numpy as np
import pandas as pd

In [68]:
Serie = pd.Series(np.random.randint(0, 10, 20))

In [69]:
Serie

0     3
1     1
2     5
3     5
4     4
5     1
6     2
7     4
8     7
9     3
10    7
11    2
12    2
13    7
14    2
15    1
16    6
17    0
18    4
19    9
dtype: int64

In [70]:
Serie.head()

0    3
1    1
2    5
3    5
4    4
dtype: int64

In [71]:
Serie.head(2)

0    3
1    1
dtype: int64

In [72]:
Serie.tail()

15    1
16    6
17    0
18    4
19    9
dtype: int64

In [73]:
Serie.unique()

array([3, 1, 5, 4, 2, 7, 6, 0, 9])

In [74]:
Serie.nunique()

9

In [75]:
Serie.value_counts()

2    4
1    3
4    3
7    3
3    2
5    2
6    1
0    1
9    1
dtype: int64

In [76]:
Serie.value_counts(normalize = True)

2    0.20
1    0.15
4    0.15
7    0.15
3    0.10
5    0.10
6    0.05
0    0.05
9    0.05
dtype: float64

In [77]:
Serie.value_counts(normalize = True, sort = True, ascending = True)

6    0.05
0    0.05
9    0.05
3    0.10
5    0.10
1    0.15
4    0.15
7    0.15
2    0.20
dtype: float64

In [78]:
Serie.value_counts(normalize = True, sort = True, ascending = True, bins = 4)

(4.5, 6.75]                      0.15
(6.75, 9.0]                      0.20
(2.25, 4.5]                      0.25
(-0.009999999999999998, 2.25]    0.40
dtype: float64

In [79]:
Serie.value_counts(normalize = True, sort = True, ascending = True, bins = 4) * 100

(4.5, 6.75]                      15.0
(6.75, 9.0]                      20.0
(2.25, 4.5]                      25.0
(-0.009999999999999998, 2.25]    40.0
dtype: float64

In [80]:
Serie.astype(float)

0     3.0
1     1.0
2     5.0
3     5.0
4     4.0
5     1.0
6     2.0
7     4.0
8     7.0
9     3.0
10    7.0
11    2.0
12    2.0
13    7.0
14    2.0
15    1.0
16    6.0
17    0.0
18    4.0
19    9.0
dtype: float64

In [81]:
Serie = Serie.astype(str)

In [82]:
Serie

0     3
1     1
2     5
3     5
4     4
5     1
6     2
7     4
8     7
9     3
10    7
11    2
12    2
13    7
14    2
15    1
16    6
17    0
18    4
19    9
dtype: object