# Análisis de datos con Pandas

<div style="text-align: right">por <a href="https://www.linkedin.com/in/angel-astudillo-aguilar" target="_blank">Angel Xavier Astudillo Aguilar</a></div>




## Recapitulación
1. Presentación e introducción al entorno de Python
2. Introducción a Python
3. Variable y Tipos de datos más comunes
4. Operadores aritméticos y textuales
5. Control de flujo: bucles
6. Control de flujo: condicionales
7. Funciones
8. Librerías en Python
9. Pandas para el análisis de datos
  - Diccionarios
  - Intro
  - Estructura de datos
    - Series
    - DataFrames
  - Cargar datos (Importar datasets)
  - Explorando un DataFrame
    - Obtener los estadísticos

## Hoy veremos
10. Análisis de datos
  - CRISP-DM
  - Análisis Exploratorio de datos



### Resolviendo dudas:
Destructuring assignment

In [None]:
a, b = 3, 4

In [None]:
a

In [None]:
b

In [None]:
c, d = [1,2]

In [None]:
c

In [None]:
d

In [None]:
[e, f] = [5,6]

In [None]:
e

In [None]:
f

In [None]:
x, y = {"nombre": "Angel", "apellido": "Astudillo"}

In [None]:
x

In [None]:
y

## Análisis de datos Pandas

### CRISP-DM
<a href="https://es.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining" target="_blank">Cross Industry Standard Process for Data Mining</a>


<img width=450 src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png">

<img width=600 src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQERxc80IzFxaLOZjg3ruBw3kqe1SzhwHqvl6XH_Qbd3HgPF5uq">

#### Análisis de datos exploratorio

<img width=600 src="https://www.researchgate.net/publication/329930775/figure/fig3/AS:873046667710469@1585161954284/The-fundamental-steps-of-the-exploratory-data-analysis-process.png">

<a href="https://www.ibm.com/topics/exploratory-data-analysis" target="_blank">Recurso 1</a>

<a href="https://medium.com/@oluwabukunmige/pipeline-for-exploratory-data-analysis-and-data-cleaning-6adce7ac0594" target="_blank">Recurso 2</a>



#### Cargar datos (Load Data)

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("./sample_data/titanic.csv")

In [None]:
cars = pd.read_excel("./sample_data/cars.xlsx")

#### Distinguir los atributos

In [None]:
cars.columns

In [None]:
cars.dtypes

In [None]:
cars.info()

#### Inspección visual

In [None]:
cars.head()

#### Obtener los estadísticos

In [None]:
cars.describe()

### Explorando por columna (Análisis univariante)

In [None]:
datos_list = ["Angel", "Astudillo", 33]
datos_list

In [None]:
datos_list[0]

In [None]:
datos_list[3]

In [None]:
datos_dict = {"nombre":"Angel", "apellido":"Astudillo", "edad":33}
datos_dict

In [None]:
datos_dict["nombre"]

In [None]:
datos_dict["edad"]

In [None]:
cars["acceleration"]

In [None]:
cars.acceleration

In [None]:
cars.acceleration.describe()

In [None]:
cars.acceleration.describe().round(2)

In [None]:
cars.acceleration.hist(color="red",n_nbins=3)

In [None]:
type(cars)

In [None]:
cars.head()

In [None]:
cars["model year"].head()

In [None]:
cars.mpg

In [None]:
cars.mpg.max()

In [None]:
cars.mpg.min()

In [None]:
cars.mpg.mean()

In [None]:
cars.mpg.quantile(0.5)

In [None]:
cars.mpg.quantile(0.9)

In [None]:
cars.head()

In [None]:
cars.weight.sum() / 1000

In [None]:
cars.cylinders.unique()

In [None]:
cars.cylinders.nunique()

In [None]:
cars.cylinders.unique()

In [None]:
cars.cylinders.value_counts()

In [None]:
cars.cylinders.value_counts().sort_index()

In [None]:
cars.head()

In [None]:
cars.shape

In [None]:
cars.nunique()

In [None]:
cars["car name"].value_counts()

In [None]:
cars.head()

In [None]:
cars.sort_values("acceleration")

In [None]:
cars.sort_values("acceleration", ascending=False).head()

In [None]:
cars.sort_values(["cylinders", "acceleration"]).head()

### Análisis bivariante
Para contar pares de variables

In [None]:
cars.head()

In [None]:
cars.cylinders.unique()

In [None]:
cars.origin.unique()

In [None]:
pd.crosstab(cars.cylinders, cars.origin)

In [None]:
pd.crosstab(df.Pclass, df.Survived).round(3)

In [None]:
pd.crosstab(df.Pclass, df.Survived, normalize="index").round(3)

In [None]:
pd.crosstab(df.Pclass, df.Survived, normalize="columns").round(3)

### Selección dentro del dataframe

In [None]:
cars.head()

#### por nombre de columna

In [None]:
cars[['mpg', 'cylinders', 'displacement']].head()

#### por condición

In [None]:
cars.head()

In [None]:
cars.mpg > 16

In [None]:
(cars.mpg > 16).sum()

In [None]:
cars[cars.mpg > 16]

In [None]:
cars[cars.acceleration > 23]

In [None]:
cars.head()

In [None]:
cars["car name"]

In [None]:
cars["car name"].str.contains("ford")

In [None]:
cars[cars["car name"].str.contains("ford")].head()

`~` es la negación lógica, lo contrario

In [None]:
cars.head()

In [None]:
~cars["car name"].str.contains("ford")

In [None]:
cars[~cars["car name"].str.contains("ford")].shape

In [None]:
cars[~cars["car name"].str.contains("ford")].head()

`&` es el "and"  
`|` es el "or"

In [None]:
cars.head()

In [None]:
(cars.mpg > 15) & (cars.weight > 3500)

In [None]:
cars[(cars.mpg > 15) & (cars.weight > 3500)].head()

In [None]:
cars[(cars.mpg > 18) | (cars.weight > 3500)].head()

In [None]:
cars[cars.cylinders == 3]

In [None]:
cars.cylinders.isin([3, 5])

In [None]:
cars[cars.cylinders.isin([3, 5])]

In [None]:
cars.acceleration.mean()

In [None]:
cars[cars.acceleration > cars.acceleration.mean()][["mpg", "acceleration", "car name"]]

#### por índice

In [None]:
cars.head()

In [None]:
cars = cars.sort_values("car name")

In [None]:
cars.head()

In [None]:
cars.loc[2]

In [None]:
cars.iloc[2]

In [None]:
cars.iloc[100:105]

### Creación de columnas

In [None]:
cars.head()

In [None]:
cars["is_fast"] = cars.acceleration > 22

In [None]:
cars.head()

In [None]:
cars["miles_per_liter"] = cars["mpg"] / 4.54

In [None]:
cars.head()

In [None]:
cars[cars.is_fast]

### Pre-procesamiento de datos


### Manejo de valores nulos

In [None]:
cars.info()

In [None]:
cars[cars.displacement.isna()]

In [None]:
cars.isna().any()

In [None]:
cars.isna().any(axis=1)

In [None]:
cars[cars.isna().any(axis=1)]

¿Qué hacemos? Podemos rellenar

#### relleno fijo

In [None]:
cars = cars.sort_index()

In [None]:
cars.head()

In [None]:
cars.fillna(0).head()

#### relleno con la media

In [None]:
cars.fillna(cars.mean()).head()

#### borrar columna

Se recomienda a partir del 70% de registros faltantes

In [None]:
cars.head()

In [None]:
cars.drop("displacement", axis=1).head()

#### borrar filas

In [None]:
cars[cars.displacement.notna()].head()

In [None]:
cars[~cars.isna().any(axis=1)].shape

In [None]:
cars_limpios = cars[~cars.isna().any(axis=1)]

In [None]:
cars_limpios.info()

Convertir datos a un tipo de dato específico

In [None]:
cars_limpios["model year"].value_counts()

In [None]:
cars_limpios["model year"].astype(int)

In [None]:
cars_limpios["model year"] = cars_limpios["model year"].astype(int)

In [None]:
cars_limpios.info()

### GroupBy

In [None]:
cars.head()

In [None]:
cars.groupby('cylinders').acceleration.mean()

In [None]:
cars.groupby('cylinders').agg({'acceleration': 'std', 'weight': 'mean'}).head()

### Exportar datos (Creación de subconjuntos de datos)

In [None]:
cars_limpios.to_csv('./sample_data/cars1.csv', index=False)

In [None]:
cars_limpios.to_csv('./sample_data/cars2.csv', index=True)

In [None]:
cars_limpios.to_json('./sample_data/cars.json', orient='records')

In [None]:
cars_limpios.to_excel('./sample_data/cars.xlsx', index=False)

### Visualización de datos




In [None]:
import seaborn as sns

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
type(penguins)

In [None]:
penguins.shape

In [None]:
penguins.head()

#### Gráfico de barras

Útil para
 - variables categoricas
 - variables numéricas discretas

In [None]:
penguins.species.value_counts()

In [None]:
sns.countplot(x=penguins.species, order=penguins.species.value_counts().index)

In [None]:
sns.countplot(x=penguins.species, palette="Reds")

In [None]:
# equiv
sns.countplot(x="island", data=penguins, palette=["lightblue", (212 / 255, 13 / 255, 68 / 255), "#65FA8C"])

#### Histogramas

Útil para una variable numérica continua

In [None]:
penguins.head()

In [None]:
sns.histplot(x=penguins.body_mass_g)

In [None]:
sns.histplot(x=penguins.body_mass_g, hue=penguins.species, element="step")

#### Scatterplot

Útil para dos variables numéricas continuas

In [None]:
penguins.head()

In [None]:
penguins.corr()

<img width=600 src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png">

In [None]:
sns.scatterplot(x="bill_length_mm", y="bill_depth_mm", data=penguins)

In [None]:
sns.scatterplot(x="bill_length_mm", y="bill_depth_mm", hue="species", data=penguins)

#### Pairplot

Útil para varias variables numéricas continuas

In [None]:
penguins.corr()

In [None]:
sns.pairplot(penguins)

In [None]:
sns.pairplot(penguins, hue="species")

## Futuro

### Preprocesamiento

Feature engineering, ML techniques

### Análisis gráficos

### Aplicación de funciones

Utilizamos el método `apply`

In [None]:
cars.head()

Imaginemos que queremos computar, para cada coche:
 * la primera palabra del nombre
 * si el año del modelo fue bisiesto

Primero creamos la función, luego la aplicamos

In [None]:
def get_first_word(sentence):
    words = sentence.split()

    return words[0]

In [None]:
get_first_word("hola me llamo manuel")

In [None]:
def is_bisiest(year):
    if year % 4 == 0:
        return True
    else:
        return False

In [None]:
is_bisiest(2001)

In [None]:
cars["is_bisiest"] = cars["model year"].apply(is_bisiest)

In [None]:
cars["name1"] = cars["car name"].apply(get_first_word)

In [None]:
cars.head()

### Recap: métodos más utilizados de pandas

```python
df.head()                     # printea la cabeza, por defecto 5 filas
df.tail()                     # pritea la cola, por defecto 5 filas
df.describe()                 # descripcion estadistica
df.info()                     # informacion del df
df.info(memory_usage='deep')
df.columns                    # muestra columna
df.index                      # muestra indice
df.dtypes                     # muestra tipos de datos de las columnas
df.plot()                     # hace un grafico
df.hist()                     # hace un histograma
df.col.value_counts()         # cuenta los valores unicos de una columna
df.col.unique()               # muestra valores unicos de una columna
df.copy()                     # copia el df
df.drop()                     # elimina columnas o filas (axis=0,1)
df.dropna()                   # elimina nulos
df.fillna()                   # rellena nulos
df.shape                      # dimensiones del df
df._get_numeric_data()        # selecciona columnas numericas
df.rename()                   # renombre columnas
df.str.replace()              # reemplaza columnas de strings
df.astype(dtype='float32')    # cambia el tipo de dato
df.iloc[]                     # localiza por indice
df.loc[]                      # localiza por elemento
df.transpose()                # transpone el df
df.T
df.sample(n, frac)            # muestra del df
df.col.sum()                  # suma de una columna
df.col.max()                  # maximo de una columna
df.col.min()                  # minimo de una columna
df[col]                       # selecciona columna
df.col
df.isnull()                   # valores nulos
df.isna()
df.notna()                    # valores no nulos
df.drop_duplicates()          # eliminar duplicados
df.reset_index(inplace=True)  # resetea el indice y sobreescribe
```

### Material extra!

* [Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* [Ejercicios para practicar](https://github.com/guipsamora/pandas_exercises)
* [merge, concat, join](https://realpython.com/pandas-merge-join-and-concat/#pandas-join-combining-data-on-a-column-or-index)