### ¿Cómo explorar los datos con pandas?

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("data.csv")

In [None]:
display(df.head(10))

In [None]:
df.loc[(df['ApplicantIncome'] >= 1000) & (df.Education == 'Not Graduate'), 'Loan_Status'] 

In [None]:
df.describe()

In [None]:
for column in df:
    print(df[column].value_counts())
    print("\n\n")

### ¿Cómo busco problemas o errores en los datos?

In [None]:
df.describe()

In [None]:
def conteo_nulos(x):
    return sum(x.isnull())

In [None]:
df.apply(conteo_nulos, axis = 0)

In [None]:
m = df['LoanAmount'].mean()
df['LoanAmount'].fillna(m, inplace = True)

In [None]:
df.apply(conteo_nulos, axis = 0)

### ¿Cómo combinar dataframes?

**Concatenación**

In [None]:
def make_df(cols, ind):
    data = {c:[str(c)+str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)

In [None]:
make_df('ABC', range(3))

In [None]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
df3 = pd.concat([df1, df2], axis = 0)
display(df1, df2, df3)

**Merge**

In [None]:
df1 = pd.DataFrame({'employee' : ['Bob', 'Jake', 'Lisa', 'Sue'], 'group' : ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee' : ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date' : [2004, 2008, 2012, 2014]})
display(df1, df2)

In [None]:
df3 = pd.merge(df1, df2)
df3

### ¿Cómo realizar un análisis agregado?

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.head()

In [None]:
planets.describe()

In [None]:
planets.dropna(inplace = True)
planets.describe()

**Existen diferencias en los tipos de planeta que cada método permite encontrar?**

In [None]:
planets.groupby('method')[['orbital_period']].mean()

**¿Hay diferencias en el uso de los distintos métodos (modernidad, cantidad, etc)?**

In [None]:
planets.groupby('method')['year'].describe()

### Más análisis agregado con tablas dinámicas

In [None]:
titanic = sns.load_dataset('titanic')
titanic.head()

**¿Existen sesgos y/o subgrupos en las víctimas?**

In [None]:
titanic.groupby('sex')[['survived']].mean()

In [None]:
titanic.groupby(['sex', 'class'])[['survived']].aggregate('mean')

In [None]:
titanic.pivot_table('survived', index='sex', columns='class')

In [None]:
titanic['age_range'] = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', 'age_range'], 'class')

In [None]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', 'age_range'], [fare, 'class'])

In [None]:
titanic.pivot_table(index='sex', columns='class', aggfunc={'survived':'sum', 'fare':'mean'})

### Visualizaciones

**Un ejemplo de muchos posibles, ¿cómo podemos identificar outliers?**

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.rcParams['text.color'] = 'blue'
plt.rcParams['axes.labelcolor'] = 'blue'
plt.rcParams['figure.figsize'] = [15, 10]
plt.rcParams.update({'font.size': 16})

In [None]:
titanic['fare'].hist(bins=20)

In [None]:
titanic.boxplot(column='fare')