### Interaction of Categorical with Quantitative Features

**Goal:**
Explore the interaction of each categorical feature with each quantitative feature. 
Ideally, I would like to make this process as repeatable as process, given I will likely want to do this with every project. So, given a list of quant vars and a list of categorical vars, I want to be able to plot each categorical var by each quant var. That sounds complicated! How do I get there? 

Take it one small step at a time. 

1. Create one plot with a single categorical variable (x) and a single quantitative variable (y)

2. Loop through each categorical variable to create a plot of each with a given a single quantitative variable. 

3. Add an external loop: loop through each quantitative variable, which will be plotted with each categorical variable. 

4. *optional:* Account for too many categorical variables to fit on a single row. 

I already have lists: `cat_vars`, `quant_vars`. 

In [None]:
print(cat_vars)


print(quant_vars)

So let's start with step 1, a single plot. I will take the first item of each list as my variables to plot. 

In [None]:
cat = cat_vars[0]
quant = quant_vars[0]

In [None]:
# 1. Create one plot with a single categorical variable (x) and 
# a single quantitative variable (y)

plt.figure(figsize=(4,3))
sns.swarmplot(x=cat, y=quant, data=train, color='lightseagreen')
plt.show()

Step 2: Loop through each categorical variable to create a plot of each with a given a single quantitative variable. I will use the same quant variable, quant[0] = 'age'. 

The number of columns will be the number of categorical variables. *NOTE: if the number of categorical variables gets much higher than a handful, we will need to adjust this to create multiple rows.*

In [None]:
quant = quant_vars[0]

# one column per categorical variable, so get number of cat_vars
cols = len(cat_vars)
_, ax = plt.subplots(nrows=1, ncols=cols, figsize=(16, 4), sharey=False)

# loop through each embark_ variable and create a swarmplot with pclass as the x-axis 
for i, cat in enumerate(cat_vars):
    sns.swarmplot(cat, quant, data=train, ax=ax[i], color='lightseagreen')
    ax[i].set_xlabel(cat)
    ax[i].set_ylabel(quant)
    ax[i].set_title(cat)    

3. Add an external loop: loop through each quantitative variable, which will be plotted with each categorical variable. 

In [None]:

# one row per quant var, so we will plot after each run through the outer looop
# one column per categorical variable, so get number of cat_vars
cols = len(cat_vars)

for quant in quant_vars:
    _, ax = plt.subplots(nrows=1, ncols=cols, figsize=(16, 4), sharey=False)
    for i, cat in enumerate(cat_vars):
        sns.swarmplot(cat, quant, data=train, ax=ax[i], color='lightseagreen')
        ax[i].set_xlabel('')
        ax[i].set_ylabel(quant)
        ax[i].set_title(cat)
    plt.show()

**A little bonus learning**

*optional:* Account for too many categorical variables to fit on a single row. 

Remember in 6th(?) grade math when you were given a table and needed to observe the patterns and come up with the next items? And then in 7th grade(?) you had to come up with the formula that would take in the 'x' and produce the 'y' given that table? Or maybe you just remember the term, "Systems of Equations". Well, here is a direct use case of systems of equations in "real life". 

The number of columns will be the number of categorical variables, but if the number of categorical variables gets much higher than a handful, we will want to adjust this to create multiple rows.

So, let's say the max in one row is 5. So the max cols will be 5. 
that means if the number of categorical variables is > 5, we will want to add another row. Let's think through some scenarios to get the patterns:


| vars | cols | rows | 
|------|------|------|
| 1    | 1    | 1    |
| 5    | 5    | 1    |
| 7    | 5    | 2    |
| 10   | 5    | 2    |
| 12   | 5    | 3    |


- cols-formula: min(vars,5) 
- rows-formula: ceiling(vars/5)

| vars | cols-forumla       | rows-formula     | 
|------|--------------------|------------------|
| 1    | min(vars,5) = 1    | ceil(vars/5) = 1 |
| 5    | min(vars,5) = 5    | ceil(vars/5) = 1 |
| 7    | min(vars,5) = 5    | ceil(vars/5) = 2 |
| 10   | min(vars,5) = 5    | ceil(vars/5) = 2 |
| 12   | min(vars,5) = 5    | ceil(vars/5) = 3 |


In [None]:
# one row per quant var, so we will plot after each run through the outer looop
# one column per categorical variable, so get number of cat_vars
import math

cols = min(len(cat_vars), 5)
rows = math.ceil(len(cat_vars)/5)

rows, cols

In [None]:
# For each quant variable, create number of rows and cols indicated above, plotting each 
# categorical variable. 
# show the plots then go to the next quant variable. 

for quant in quant_vars:
    _, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(16, 4), sharey=False)
    for i, cat in enumerate(cat_vars):
        sns.swarmplot(cat, quant, data=train, ax=ax[i], color='lightseagreen')
        ax[i].set_xlabel('')
        ax[i].set_ylabel(quant)
        ax[i].set_title(cat)
    plt.show()

**Takeaways**

One thing we could do to remove the interdependence, is normalize the fare within each class. In other words, does that fact that a passenger who paid top price within first class matter? Does that passenger have a higher change of survival than the passenger who paid the lowest rate in third class? 

I will not act on this right now, but I want to document it for later. 

## Reference

### Single Variable Distribution Plots 

#### Histogram

*matplotlib*:  plt.hist(df.x_var)

*seaborn*: sns.histplot(data=df, x=x_var, *hue=color_var*)

*pandas*: df.x_var.value_counts().plot.bar()

#### Boxplot

*matplotlib*:  plt.boxplot(df.y_var) 

*seaborn*: sns.boxplot(*x=x_var,* y=y_var, *hue=color_var*, data=df)


### Categorical x Continuous

Really, continuous implies quantitative, not necessarily the mathematical definition of continuous. 

#### Pairplot

sns.pairplot(data, hue=cat_var, vars=\[var1, var2, ...\], kind=\{'scatter', 'kde', 'hist', 'reg'\}, diag_kind=\{'auto', 'hist', 'kde', None\})

#### Relplot

- x = continuous variable (or temporal)
- y = continuous variable 

*optional*

- hue = categorical_var (color)
- size = continuous_var (width of line, size of dot)
- style = categorical_var (style of line, style of dot)
- col = categorical_var (for catplots only, split charts into columns by category) 
- row = categorical_var (for catplots only, split charts into columns by category)

Types

- scatterplot(), relplot(kind="scatter")
- lineplot(), relplot(kind="line")

sns.relplot(x=x_var, y=y_var, data=df, kind=\{"line", "scatter"\}, hue=cat_var, col=cat_var)

#### Heatmap

sns.heatmap(data=df\[\['col1', 'col2'\]\], annot=False, cmap= fmt=
pd.crosstab(train.embark_town, train.survived, margins=True, normalize=True)


### Categorical x Continuous

#### Catplot

- x = categorical_var
- y = continuous_var 

*optional*

- hue = categorical_var (color) 
- size = continuous_var (for categorical scatterplots, size of dots)
- style = categorical_var (for categorical scatterplots, style of dots)
- col = categorical_var (for catplots only, split charts into columns by category) 
- row = categorical_var (for catplots only, split charts into columns by category)

Categorical Scatterplots

- stripplot(), catplot(kind="strip")
- swarmplot(), catplot(kind="swarm")

Categorical Distribution Plots

- boxplot(), catplot(kind="box")
- violinplot(), catplot(kind="violin")
- boxenplot(), catplot(kind="box")

Categorical Estimate Plots

- pointplot(), catplot(kind="point")
- barplot(), catplot(kind="bar")
- countplot(), catplot(kind="count")


sns.catplot(x=cat_var, y=cont_var, col=column_cat_var, row=row_cat_var, kind=

### Categorical x Categorical


#### Catplot

If one of the variables, then the catplot is useful. The y-axis will represent the boolean variable and the range will be 0 to 1. 



#### Heatmap

A heatmap can be used with two categorical variables. You can plot proportions by creating a crosstab and then plotting the heatmap from the crosstab. 

control columns & rows

In [None]:

# If you have n amount of features and want to produce a pair for each one
import itertools

# Get our list of continuous features
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# itertools.combinations generates each unique pair of elements
# We'll produce a list of each pair, where each pair is a tuple (like a list)
pairs = itertools.combinations(features, r=2)
pairs = list(pairs)
pairs

In [None]:
# Define our number of rows and columns beforehand
nrows = 2
ncols = 3

# Setup the subplots
_, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 6), sharey=True)

for i, pair in enumerate(pairs):

    column = i % ncols    
    row = (i - column) // ncols
    
    sns.scatterplot(x=pair[0], y=pair[1], data=df, hue="species", ax=ax[row][column])