# Tema: Análisis Descriptivo e Inspección
<br/><br/>

<center>
    
## Taller de Ciencia de Datos
### Omar Piña Ramírez
### Instituto Nacional de Perinatología
### Departamento de Bioinformática y Análisis Estadísticos
### Investigador en Ciencias Médicas
### delozath@gmail.com
</center>

In [None]:
%%html
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height:1500px;  /* your desired max-height here */
}
.output_scroll {
    box-shadow:none !important;
    webkit-box-shadow:none !important;
}
.CodeMirror{
    font-size: 20px;
}

.rendered_html table, .rendered_html td, .rendered_html th {
    font-size: 120%;
}
</style>

In [None]:
import numpy   as np
import pandas  as pd
import seaborn as sns

from   matplotlib import pyplot as plt

import ipywidgets as widgets
from   ipywidgets import interact, interact_manual, FloatSlider, Layout

import chart_studio.plotly as py
import plotly.graph_objs   as go
import plotly.express      as px
from   plotly.offline      import iplot, init_notebook_mode

import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode (connected=True)

## Database

[Kaggle: Cardiovascular Disease heart attack by statistical learning](https://www.kaggle.com/yassinehamdaoui1/cardiovascular-disease/code)

| Feature  | Descripción                         | Tipo       |
|---------:|:------------------------------------|:-----------|
|sbp       | Systolic blood pressure             | Numérico   | 
|tobacco   | Cumulative tobacco                  | Numérico   | 
|ldl       | Low density lipoprotein cholesterol | Numérico   | 
|adiposity |                                     | Numérico   | 
|famhist   | Family history of heart disease     | Categórico | 
|typea     | Type-A behavior                     | Numérico   | 
|obesity   |                                     | Numérico   | 
|alcohol   | Current alcohol consumption         | Numérico   | 
|age       | Age at onset                        | Numérico   | 
|chd       | Response                            | Target     |


[Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430-436.](https://journals.co.za/doi/pdf/10.10520/AJA20785135_9894)

## Interacción básica con las bases de datos
Biblioteca pandas

```python
import pandas as pd
```

In [None]:
# abrir archivos
PATH = './data/'
file = 'cardiovascular.csv'

data = pd.read_csv(PATH + file)

In [None]:
data.head()

In [None]:
# columnas, tipo variable
for n,(i,j) in enumerate(zip(data.columns,data.dtypes)):
    print(  "Columna {:2d}: {:9s} -> {:s}".format(n,i,str(j))  )

In [None]:
#estadística
data.describe()

## Búsquedas y filtros (query)

In [None]:
query = data['sbp'] > 180
data.loc[query]

In [None]:
query = data['sbp'].between(105,200)
data.loc[~query]

In [None]:
data.query("sbp<=105 | sbp>=200")

In [None]:
data.query("sbp<=105 | adiposity<10")

## Queries dinámicos
A través de **decorators** de la clase ipywidgets

```python
import ipywidgets as widgets
from   ipywidgets import interact, interact_manual, FloatSlider, Layout

@interact
def funcion()
    return
```

In [None]:
@interact
def show(column=['sbp','obesity'], x=(0,250,1)):
    return data.loc[data[column] < x]

In [None]:
numeric_cols = data.select_dtypes(np.number)

@interact
def show(column=numeric_cols, x=(0,250,1)):
    return data.loc[data[column] > x]

In [None]:
category_cols = data.select_dtypes(exclude=np.number)
category_cols

## Gráficas estáticas

```python
import seaborn as sns
from   matplotlib import pyplot as plt
```

In [None]:
fig, ax  = plt.subplots(1,1,figsize=(8,8))
sns.scatterplot(x='adiposity',y='obesity',data=data, ax=ax, s=100)
ax.grid(b=True, which='major', color='black', linewidth=0.075)
plt.show()

## Gráficas dinámicas

```python
import ipywidgets as widgets
from   ipywidgets import interact, interact_manual, FloatSlider, Layout

import chart_studio.plotly as py
import plotly.graph_objs   as go
import plotly.express      as px
from   plotly.offline      import iplot, init_notebook_mode

import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode  (connected=True)

@interact
def scatter_plot(parms):
    px.plot
```

In [None]:
numeric_cols = data.select_dtypes(include=np.number).columns
hue_cols     = ['chd','famhist']

@interact
def scatter_plot(x=numeric_cols[1:], 
                 y=numeric_cols[2:],hue=hue_cols):
    
    fig = px.scatter(data, x=x, y=y, color=hue,
                     title=f"{x.title()} vs {y.title()}")
    fig.update_traces(marker={'size': 15})
    fig.show()

In [None]:
colors = px.colors.qualitative.T10
@interact
def scatter_matrixt(hue=hue_cols):
    fig = px.scatter_matrix(data, 
                             color=hue, 
                             color_discrete_sequence=colors
                            )
    fig.update_traces(marker={'size': 5})
    fig.show()

In [None]:
numeric_cols = data.select_dtypes('number').columns

@interact
def hist_plot(x=numeric_cols):
    data[x].iplot(kind='hist', x=x, 
             xTitle=x.title(),  
             title=f'Histograma')

## Datos para ejercicios

UCI Machine Learning Repositoru [Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29)

[Archivo de descripción](./data/breast-cancer-wisconsin.names)

[Datos originales](./data/breast-cancer-wisconsin.data)

[Datos limpios y con igual prevalencia](./data/breast-cancer-wisconsin_scrubbed_eq-prev.csv)