# Categorial Plotting with Seaborn

We've had a fair bit of relational line and scatter plotting with 2D numerical data, but what about when data is not numerical but **categorical**, like male/female, public/charter/private, own/rent, ...? 

**Seaborn** is a popular graphing package built on top of matplotlib, which offers a suite of attractive, data-rich chart styles, including `catplot`. [This link is worth bookmarking for reference](https://seaborn.pydata.org/tutorial/categorical.html#categorical-tutorial). 

Let's import libraries and load up a dataset about the Titanic.

In [None]:
import matplotlib.pyplot as plt
import pandas  as pd
import seaborn as sns # sns='Sam Norman Seaborn'
# the creator of seaborn is a huge fan of the TV show 'West Wing'
                 
t3 = pd.read_csv('https://raw.githubusercontent.com/RubeRad/camcom/master/titanic3.csv')
t3.info()

In [None]:
t3.head()

Take a moment to read those first 5 rows.

Row 0 is an unmarried woman, 29 years old, traveling alone, 1st class, who survived.

The next 4 rows are a family of 4, the Allisons: 2 parents, a 2-year old daughter and under 1 year old son (all traveling with #siblings/spouses=2 and #parents/children=2). Only the baby boy survived.

There are a couple ways to get some total survival statistics:

In [None]:
t3['survived'].value_counts()

In [None]:
survived = t3['survived'].sum()   # sum of 500 ones (and 809 zeroes)
total    = t3['survived'].count() # 1309
died = total - survived           # 809
diepct =     (died/total)*100
livpct = (survived/total)*100

In [None]:
plt.figure()
plt.gca().set_ylabel('Percent')
plt.bar(  ['Died','Survived'],   # array of categories
          [diepct, livpct]    )  # array of bar heights
plt.show()

## Tufte Time
How many pieces of data are displayed in that bar chart? What is the data ink ratio? This is why Tufte has a generally low opinion of bar charts.

In [None]:
plt.figure()
plt.gca().set_ylabel('Percent')
plt.bar(['Died','Survived'], [diepct, livpct])
plt.text(0.5, 35, 'LAME', ha='center', va='center',
         rotation=45, c='r', size=75, fontdict={'weight':'bold'} )
plt.show()

## Seaborn `catplot`
Let's examine Seaborn's categorical plotting on another dataset and come back to apply what we've learned to the Titanic.

We use the dataset 'Tips', which is an educational standard. Who are better tippers -- men/women? smokers/non? Weekday/weekend/lunch/dinner diners? Small/large parties?

In [None]:
tips = sns.load_dataset('tips')
tips.info()

In [None]:
tips.head()

Let's add another column:

In [None]:
tips['pct'] = tips['tip'] / tips['total_bill'] * 100
tips.info()
tips['pct'].describe()

Seaborn's `catplot` takes categorical data as the x axis, and numerical data as the y axis (or the other way around!)


In [None]:
sns.catplot(x='sex', y='pct', data=tips, kind='swarm')

## Exercise
The default kind of catplot is 'strip'. Try regraphing with other kinds: 'swarm', 'box', 'boxen', 'violin', 'point', 'bar', 'count'.

Also try swapping x and y.

## Another dimension
It's always better to display more dimensions of data. Seaborn can include a third dimension by varying color with 'hue'

In [None]:
sns.catplot(x='time', y='pct', hue='day', data=tips, kind='swarm')

## Exercise
* Try different kinds (strip, swarm, box, boxen, violin)
* Try adding the option dodge=True
* Try swapping x and hue
What combination is more informative?

## Another dimension
It's always better to display more dimensions of data. Seaborn can include a fourth (and fifth?) dimension by placing multiple plots side-by-side

In [None]:
sns.catplot(x='sex', y='pct', hue='day', col='smoker', row='time', data=tips)

## Exercise
* Try col='smoker' and/or row='time'

# Back to the Titanic

Just like we made an additional useful column `pct` in Tips, let's make an additional useful column in Titanic

In [None]:
t3.loc[t3.sex=='male', 'who'] = 'Man'
t3['who'].value_counts()

In [None]:
t3.loc[t3.sex=='female', 'who'] = 'Woman'
t3['who'].value_counts()

In [None]:
t3.loc[t3.age<=16, 'who'] = 'Child'
t3['who'].value_counts()

In [None]:
t3.info()

## Exercise
Look at those columns there, a few of them can be plotted numerically (age, fare), many are interesting categorically (survived, pclass, sex, who)

Use `sns.catplot` and choose Series for x, y, hue, row and/or col, as well as a plot kind, and optionally dodge=True.

Which configuration does the best job revealing a pattern of who is most likely to survive?

In [None]:
sns.catplot(x='who', y='age', hue='survived', data=t3, kind='swarm', col='pclass',
           palette={0:'k',1:'lightgreen'})

In [None]:
wapoALL = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/police_shootings/police_shootings.csv')

In [None]:
wapoALL = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/police_shootings/police_shootings.csv')
wapo = wapoALL[wapoALL['Incident.Date.Year']==2016]
wapo4 = wapo[wapo['Factors.Armed'].isin(['gun', 'knife', 'toy weapon', 'unarmed'])]
wapo4['Factors.Armed'].value_counts()

In [None]:
wapo.info()
wapo['Factors.Fleeing'].value_counts()

In [None]:
wapo4 = wapo[wapo['Factors.Armed'].isin(['gun', 'knife', 'toy weapon', 'unarmed'])]
wapo4['Factors.Armed'].value_counts()

In [None]:
wapo4['Factors.Armed'].value_counts()

In [None]:
sns.catplot(x='Factors.Armed', y='Person.Age', data=wapo4, kind='strip')