
## EDA Process

We start by thinking about our data and asking questions.

If there's a plan or even a chance you will build models with this data, be sure to split the dataset and only explore train. By exploring only train data, we keep from peeking at our out of sample data. If you are absolutely not going to model, it's OK to explore the entire dataset. Any exploration more in-depth than a histogram or .value_counts should only be done on train.

Hypothesize: Form and document your initial hypotheses about how the predictors (independent variables, features, or attributes) interact with the target (y-value or dependent variable). You can do this in the form of questions in a natural language (as opposed to "statistical" language).

Visualize: use visualization techniques to identify drivers. When a visualization needs to be followed up with a statistical test, do so.

Test your hypotheses when visualization isn't immediately clear. Use the appropriate statistical tests (t-tests, correlation, chi-square)

## General Recipe

Bivariate Stats. Bivariate means two variables.

Plot the interactions of each variable with the target. Document your takeaways.

For numeric to numeric, use a scatterplot or lineplot

For numeric to categorical variables, see https://seaborn.pydata.org/tutorial/categorical.html

Explore interation of independent variables using viz and/or hypothesis testing to address interdependence.

Multivariate Stats (more than 2 variables): Ask additional questions of the data, such as how subgroups compare to each-other and to the overall population. Answer questions using visualizations and/or hypothesis testing.

If you're using seaborn's relplot or catplot, use the hue or col arguments to add extra dimension(s) to the visuals.
Using sns.pairplot with hue may be helpful. With too many columns, however, it can produce visuals that are too noisy to be useful.
See https://seaborn.pydata.org/tutorial/axis_grids.html for more multivariate options
Statistical Tests: If the visualizations are not crystal clear, it's important to conduct hypothesis tests.

With numeric to numeric, test for correlation with Pearson's R for linear and Spearman's R for non-linear relationships.

For numeric to categorical, compare the means of two populations or a subgroup to the population using a t-test, if your samples are normaly(ish) distributed but have different variances (as determined by calling .var() on each column/Series, ANOVA to compare means from more than 2 groups, or a Mann-Whitney u-test if the data does not match the assumptions of a t-test.

With categorical to categorical variables, use 
χ
2
, chi-squared test.


## Standing Orders for Exploration

Document your initial questions or assumptions. Write them down (in your README or notebook) so they are concrete and not in your head.

Document your takeaways after each visualization. Even if your takeaway is, "there is nothing interesting between var1 and target".

Document your answer to each question.

When you run statistical tests to answer your questions, Document your null and alternative hypothesis, the test you run, the test results, and your conclusion.

Document your takeaways, in case that wasn't clear. It is a huge component of your final deliverable/analysis.

Document your action plan. What are your next steps and/or new questions based on what you have learned? I recommend documenting, continuing through all of your questions, and then going back and taking action only after you have answered your initial questions.


In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split

from acquire import get_titanic_data

np.random.seed(123)

Found your sauce my bro
Found your sauce my bro
Found your sauce my bro


In [2]:
df = get_titanic_data()
df.head(2)


Found your sauce my bro


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0


In [3]:
df.isna().sum()

passenger_id      0
survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
deck            688
embark_town       2
alone             0
dtype: int64

In [4]:
# Useful helper for checking for nulls
# What proportion of each column is empty?
df.isna().mean()

passenger_id    0.000000
survived        0.000000
pclass          0.000000
sex             0.000000
age             0.198653
sibsp           0.000000
parch           0.000000
fare            0.000000
embarked        0.002245
class           0.000000
deck            0.772166
embark_town     0.002245
alone           0.000000
dtype: float64

In [5]:
# drop rows where age or embarked is null, drop column 'deck', drop passenger_id
def prep_titanic(df):
    '''
    take in titanc dataframe, remove all rows where age or embarked is null, 
    get dummy variables for sex and embark_town, 
    and drop sex, deck, passenger_id, class, and embark_town. 
    '''

    df = df[(df.age.notna()) & (df.embarked.notna())]
    df = df.drop(columns=['deck', 'passenger_id', 'class'])

    dummy_df = pd.get_dummies(df[['sex', 'embark_town']], prefix=['sex', 'embark'])

    df = pd.concat([df, dummy_df.drop(columns=['sex_male'])], axis=1)

    df = df.drop(columns=['sex', 'embark_town']) 

    df = df.rename(columns={"sex_female": "is_female"})

    return df

In [6]:
df = prep_titanic(df)
df.head(2)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,alone,is_female,embark_Cherbourg,embark_Queenstown,embark_Southampton
0,0,3,22.0,1,0,7.25,S,0,0,0,0,1
1,1,1,38.0,1,0,71.2833,C,0,1,1,0,0


In [7]:
def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [8]:
# Stratify with categorical target variables
train, validate, test = train_validate_test_split(df, target='survived')
train.shape, validate.shape, test.shape

((398, 12), (171, 12), (143, 12))

In [9]:
# Stratification means we'll get even proportions of the target variable in each data set
train.survived.mean(), validate.survived.mean(), test.survived.mean()

(0.4045226130653266, 0.40350877192982454, 0.40559440559440557)

In [10]:
train.head(2)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,alone,is_female,embark_Cherbourg,embark_Queenstown,embark_Southampton
450,0,2,36.0,1,2,27.75,S,0,0,0,0,1
543,1,2,32.0,1,0,26.0,S,0,0,0,0,1
