# Data Exploration

### Contributors
Professor Foster Provost - NYU Stern School of Business and Carlos Fernandez - teaching assistant


## Imports Needed

In [None]:
import os
import numpy as np
import pandas as pd
import math
import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline
sns.set(style='ticks', palette='Set2')

## Predicting who will survive the Titanic

This time we will use a classic introductory dataset that contains demographic and traveling information for the Titanic passengers. The goal is to predict the survival of these passengers. We will only keep a few variables of interest and transform all of them to numeric variables. We will also drop some outliers.

In [None]:
# Load data
path = "./data/titanic.csv"
df = pd.read_csv(path)[["survived", "pclass", "sex", "age", "fare"]].dropna()
df['survive'] = df.survived.astype(bool)
df.head()

In [None]:
# create a copy of df and call it df2 so that we can make some transformations and still keep df
df2 = df.copy()
# Transform sex column to a numeric variable
df2["female"] = (df2.sex == "female").astype(int)
df2 = df2.drop("sex", axis="columns")
# Drop outliers. This is to help the visualization in the next examples.
df2 = df2[df2.fare < 400]
# Take a look at the data
df2.head(5)

In [None]:
df.info()

In [None]:
df2.info()

In [None]:
# just the numeric columns
df.describe()

In [None]:
# all columns including the categorical
df.describe(include='all')

In [None]:
# just the categorical columns
df.describe(include=['O','bool'])

In [None]:
# describe specific column (series)
# what does the mean here tell us?
df2.female.describe()

### Exploration Visualizations

We'd like to use information about the passengers to predict whether they will survive. Let's start by taking a look at how well some of the variables "split" the data according to our target.

In [None]:
df.groupby('sex')['fare'].median()

In [None]:
def createBoxPlot(df, x, y):
    sns.set(style="whitegrid")
    p = sns.boxplot(x=x, y=y, data=df)
    m1 = df.groupby([x])[y].median().values
    mL1 = [str(np.round(s, 2)) for s in m1]

    ind = 0
    for tick in range(len(p.get_xticklabels())):
        p.text(tick-.2, m1[ind], mL1[ind],  horizontalalignment='center',  color='w', weight='semibold')
        ind += 1    
    plt.show()

In [None]:
createBoxPlot(df,'sex','fare')

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="survive", y="fare", hue='sex', width=0.4, data=df)
plt.show()

Above we see boxplots that shows the fare distribution grouped by our target variable (survival). The left boxplot corresponds to people that died and the right one to people that survived. Alternatively, we could plot the distribution of fare according to survival

In [None]:
for r in range(2):
    hist = df[df.survived == r].hist('fare')
    plt.title("survived =" + str(r))
    plt.ylim(0,470)
    plt.show()

It seems that people that paid less are less likely to survive. We could use this to predict that people that paid more than 50 will survive.

In [None]:
def createCorrelationPlot(df):
    sns.set(style="white")
    # Compute the correlation matrix
    #corr = d.corr()

    # Generate a mask for the upper triangle
    #mask = np.triu(np.ones_like(corr, dtype=np.bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    #cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(df.corr()
                ,mask=np.triu(np.ones_like(df.corr(), dtype=np.bool))
                ,cmap=sns.diverging_palette(220, 10, as_cmap=True)
                ,vmax=.3, center=0
                ,square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.show()

In [None]:
createCorrelationPlot(df)

### Skewness & Kurtosis

#### Kurtosis: nature of the peaks of distribution
* Larger value indicates
    * sharper peak
    * smaller variance and/or fewer extreme values
* Positive value: acute peak
* Negative value: flat peak

#### Skewness: how asymmetrically the data is distributed
* Greater than 1
    * Obvious extended spread in one direction
* Positive value: longer right tail
* Negative value: longer left tail
* The greater skewness, the greater distortion and the worse for data mining generally

In [None]:
import seaborn as sns
from scipy.stats import kurtosis, skew

sns.distplot(df.fare, rug=True)
plt.show()

print('kurtosis is: {}'.format(kurtosis(df.fare)))

print('skewness is: {}'.format(skew(df.fare)))

In [None]:
for s in df.sex.unique():
    sns.distplot(df.fare[df.sex==s])

In [None]:
import seaborn as sns
from scipy.stats import kurtosis, skew

sns.distplot(df.age)
plt.show()

print('kurtosis is: {}'.format(kurtosis(df.age)))

print('skewness is: {}'.format(skew(df.age)))

In [None]:
from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt

mosaic(df, ['sex', 'survive'])
plt.xlabel('Sex')
plt.show()