# Classification:  Data Exploration

> Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.
>
> [Prasad Patil, Novice Data Science Storyteller, Mar 23](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)

The goals of exploration are to understand the signals in the data, their strength, the features that drive the outcome, and other features to construct through questions and hypotheses, in order to walk away with modeling strategies (feature selection, algorithm selection, evaluation methods, e.g.) and actionable insight.

In general, we'll be exploring our target variable against the independent, or predictor, variables.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_titanic_data
from prepare import train_validate_test_split

df = get_titanic_data()
df = df.drop(columns='deck')
df = df[~ df.age.isna()]
df = df[~ df.embarked.isna()]

train, validate, test = train_validate_test_split(df)

In [None]:
# validate and test to be out-of-sample

train.shape, validate.shape, test.shape

In [None]:
# in-sample means the data we look at
train.head()

## Explore the Target
- What is the thing we're trying to predict?

In [None]:
train.survived.value_counts()

In [None]:
train.survived.value_counts().plot.bar()
plt.xlabel('Survived')

In [None]:
train.survived.hist()

In [None]:
train.survived.mean()

In [None]:
train.survived.value_counts().sort_index().plot.bar()
survival_rate = train.survived.mean()
plt.title(f"Overall survival rate: {survival_rate:.2%}")
plt.xlabel('Survived')

In [None]:
train.fare.hist()

In [None]:
train.fare.value_counts()

In [None]:
train.age.hist()

`survived` can be treated as either a categorical variable or a number. For example, we are treating survived as a categorical variable when we look at the value counts above. We could also treat survived as a categorical variable and explore other features through the `survived` category.

Treating `survived` as a number lets us take the average, which we can interpret as the overall survival rate.

## Barplots

Here we'll treat `survived` as a number and explore its interactions with other categorical features. For each other categorical feature, we will calculate the survival rate among its subgroups and visualize them with a barplot.

In [None]:
features = ['sex', 'class', 'alone']

In [None]:
enumerate(features)

In [None]:
list(enumerate(features))

In [None]:
survival_rate = train.survived.mean()
_, ax = plt.subplots(nrows=1, ncols=3, figsize=(16, 6), sharey=True)
for i, feature in enumerate(features):
    sns.barplot(feature, 'survived', data=train, ax=ax[i], alpha=0.5)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('Survival Rate')
    ax[i].set_title(feature)
    ax[i].axhline(survival_rate, ls='--', color='grey')

Here we add a horizontal dashed line at the overall survival rate in order to be able to quickly visually compare the subgroup survival rates against the overall rate in addition to comparing to the survival rate for other subgroups.

The black lines on the top of each bar give us the 95% confidence interval for our estimate of the average for each subgroup.

In [None]:
# hist is a histogram plot in seaborn
sns.histplot(x="class", data=train, hue="survived")

In [None]:
# countplot
sns.countplot(x="class", data=train, hue="survived")

### Continuous vs. Continuous
- Use a .scatterplot
- Try a .regplot

In [None]:
# Hue can hold a categorical or a continuous (recommend for categorical)
sns.scatterplot(x="age", y="fare", hue="survived", data=train)

In [None]:
# col argument is a discrete variable feature in our dataset
sns.relplot(x="age", y="fare", col="survived", data=train)

### Swarmplot: Discrete x Continuous

A swarmplot can be used to plot a numeric variable with a discrete or categorical variable. Here we are looking at the relationship between class and age and adding the additional dimension of whether or not the passenger survived.

In [None]:
sns.swarmplot(x="pclass", y="age", data=train, hue="survived", palette="Set2")
plt.legend()

In [None]:
sns.swarmplot(x="sex", y="fare", data=train, hue="survived", palette="Set2")
plt.legend()

In [None]:
sns.swarmplot(x="fare", y="sex", data=train, hue="survived", palette="Set2")
plt.legend()

In [None]:
# Using a Catplot
sns.catplot(x="pclass", y="age", data=train, hue="survived")

In [None]:
# Catplot with kind="count"
sns.catplot(x="survived", col="pclass", data=train, kind="count")

In [None]:
# Catplot with kind="count"
sns.catplot(x="sex", y="fare", data=train, kind="violin")

In [None]:
sns.dogplot()

### Violinplot: Discrete x Continuous


In [None]:
features = ["class", "embarked", "alone"]
_, ax = plt.subplots(nrows=1, ncols=3, figsize=(12, 6))

for i, feature in enumerate(features):
    sns.violinplot(
        feature,
        "age",
        hue="survived",
        data=train,
        split=True,
        ax=ax[i],
        palette=["blue", "orange"],
    )

In [None]:
train.head()

In [None]:
train.fare.value_counts()

### Crosstab: Discrete x Discrete

Matrix of counts or probabilities

In [None]:
pd.crosstab(train.pclass, train.survived, margins=True)

In [None]:
pd.crosstab(train.pclass, train.survived, margins=True, normalize=True)

In [None]:
crosstab = pd.crosstab(train.pclass, train.survived, margins=True, normalize=True)
sns.heatmap(crosstab)

There definitely appears to be a difference in the survival rate of those in 1st class vs. 3rd class. 

## Continuous x Continuous

In [None]:
sns.relplot(x="fare", y="age", hue="survived", data=train, height=6, aspect=1.6)
plt.xlim(0, 175)

In [None]:
sns.lmplot(x="fare", y="age", hue="survived", data=train)

### Melting Multiple Continuous Variables

Melting lets us compare multiple continuous variables that have the same or similarly scaled units on the same visualization.

In [None]:
sns.set(style="whitegrid", palette="muted")

# Melt the dataset to "long-form" representation
melt = train[['survived', 'age', 'fare']].melt(id_vars="survived", var_name="measurement")
melt

In [None]:
plt.figure(figsize=(8,6))
p = sns.swarmplot(
    x="measurement",
    y="value",
    hue="survived",
    data=melt,
)

# setting to logscale 
p.set(yscale="log", xlabel='')
plt.show()

In [None]:
#
from pydataset import data
tips = data("tips")
tips.head()

In [None]:
melt = tips[['sex', 'tip', 'total_bill']].melt(id_vars="sex", var_name="measurement")


plt.figure(figsize=(8,6))
p = sns.swarmplot(
    x="measurement",
    y="value",
    hue="sex",
    data=melt,
)

# setting to logscale 
plt.show()

## Statistical Testing

As an example of statistical testing, we'll take a look at the relationship between survival and age. For all of our statistical testing:

1. Experiments should be **reproducible**: If someone runs through the experiment with the same data, they should get the same results.  If someone runs through the experiment with another sample of the data, they should arrive at the same conclusion.  
2. Experiments should be **documented** succintly and focusing on the essential aspects of the tasks involved.

Experiment: Compare two groups: Is the age of survivors significantly different from that of the non-survivors?

- $H_{0}$: the difference in age between subset A, survivors, and subset B, non-survivors, is insubstantial (basically 0).  
- $H_{a}$: the difference between subset A, survivors, and subset B, non-survivors is substantial
- Test if the observations with survivors (subset A) has significantly different ages than that of non-survivors (subset B).  
- If there is a difference, then `Age` is a good choice to keep as a feature.  
- We can use a t-test here, as `Age` is somewhat normally distributed.     

In [None]:
from scipy import stats

stats.ttest_ind(
    train[train.survived == 1].age.dropna(),
    train[train.survived == 0].age.dropna(),
)

## Take a moment to list out the categorical variable
- If we're doing classification, our target variable will be *categorical*
- List out our continuous variables
- If we're comparing continuous to continuous, we use .correlation 
    - Example: total_bill to tip
    - Example: fare to age
- If we're comparing categorical to categorical, we're chi^2
    - Example: pclass to sex
    - Example: day to smoker (in the tips dataset)
- If we're comparing continous among different categories, t-test
    - Example: comparing total_bills between smokers and nonsmokers
    - Example: comparing fare between pclass on Titanic
    - Example: comparing age between pclass on Titanic dataset

Based on our p-value, we would fail to reject the null hypothesis that age is not significantly different for passengers that survived and those that didn't.