<a href="https://colab.research.google.com/github/RubeRad/tcscs/blob/master/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Categorial Plotting with Seaborn

We've had a fair bit of relational line and scatter plotting with 2D numerical data, but what about when data is not numerical but **categorical**, like male/female, public/charter/private, own/rent, ...? 

**Seaborn** is a popular graphing package built on top of matplotlib, which offers a suite of attractive, data-rich chart styles, including `catplot`. [This link is worth bookmarking for reference](https://seaborn.pydata.org/tutorial/categorical.html#categorical-tutorial). 

Let's import libraries and load up a dataset about the Titanic.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas  as pd
import seaborn as sns # sns='Sam Norman Seaborn'
# Sam Seaborn was Rob Lowe's character in 'West Wing'
# The creator of seaborn is a huge fan of WW and created other modules
# named after other characters (lyman, moss, https://github.com/mwaskom)

In [None]:
# This is a famous dataset of Titanic passenger statistics
t3 = pd.read_csv('https://raw.githubusercontent.com/RubeRad/camcom/master/titanic3.csv')
t3.head()

Take a moment to read those first 5 rows.

Row 0 is an unmarried woman, 29 years old, traveling alone, 1st class, who survived.

The next 4 rows are a family of 4, the Allisons: 2 parents, a 2-year old daughter and under 1 year old son (all traveling with #siblings/spouses=2 and #parents/children=2). Only the baby boy survived.

There are a couple ways to get some total survival statistics:

In [None]:
t3.survived.value_counts()

In [None]:
survived = t3.survived.sum()   # sum of 500 ones (and 809 zeroes)
total    = t3.survived.count() # 1309
died   = total - survived      # 809
diepct =     (died/total)*100
livpct = (survived/total)*100

In [None]:
diepct

In [None]:
plt.figure()
plt.gca().set_ylabel('Percent')
plt.bar(  ['Died','Survived'],   # array of categories
          [diepct, livpct]    )  # array of bar heights

#plt.text(0.5, 35, 'LAME', ha='center', va='center',
#        rotation=45, c='r', size=75, fontdict={'weight':'bold'} )

plt.show()

## Tufte Time
How many pieces of data are displayed in that bar chart? What is the data ink ratio? This is why Tufte has a generally low opinion of bar charts.

Go back and uncomment that plt.txt() and recreate the graph

## Seaborn `catplot`
Let's examine Seaborn's categorical plotting on another dataset and come back to apply what we've learned to the Titanic.

We use the dataset 'Tips', which is an educational standard. Who are better tippers -- men/women? smokers/non? Weekday/weekend/lunch/dinner diners? Small/large parties?

In [None]:
tips = sns.load_dataset('tips')
tips.head()

Let's add another column:

In [None]:
# notice for creating a new column we need ['']
tips['pct'] = tips.tip / tips.total_bill * 100  
tips.info()

In [None]:
# once the column exits we can use the simpler dataframe.column notation
tips.pct.describe()

## Exercise 0
Let's use matplotlib to make a graph to see if Male or Female customers are better tippers!

Problem is, we need numbers in order to graph.

Run the cell as-is, and uncomment one extra command at a time to see what it does

In [None]:
tips['sex01'] = (tips.sex=='Female') # Falses = Males --> 0; Trues = Females --> 1, 
#tips['sex01'] += np.random.uniform(-0.2, 0.2, 244)

plt.figure()
ax = plt.gca()

ax.plot(tips.sex01, tips.pct, ls='', marker='.')

#ax.set_xticks([0,1])
#ax.set_xlim(-0.5,1.5)
#ax.set_xticklabels(['Male', 'Female'])

plt.show()


# Seaborn

That's a lot of work! Seaborn was made to do that kind of thing a lot easier.

Seaborn's `catplot` takes categorical data as the x axis, and numerical data as the y axis (or the other way around!)

This is the most basic possible use:

In [None]:
sns.catplot(data=tips, x='sex', y='pct')   

## Exercise 1: diferent kinds
The default kind of catplot is `kind='strip'`. Try regraphing with other kinds: `'swarm'`, `'box'`, `'boxen'`, `'violin'`, `'point'`, `'bar'`, `'count'`.

## Exercise 2: Swap x and y
Seaborn automatically decides if x or y is categorical/numerical and how to handle it.

After noticing that swapping x and y works, put it back.

## Exercise 3: Use hue to display another dimension
It's always better to display more dimensions of data. Seaborn can include a third dimension by varying color with `hue`. Try adding `hue='day'` or `'smoker'` or `'time'`

Also test using `dodge=True`

## Exercise 4: Split graphs along col to display another dimension
Try adding `col='day'` or `'smoker'` or `'time'` (whatever you didn't use for hue)

## Exercise 5: Split graphs along row to display another dimension
Try adding `row='day'` or `'smoker'` or `'time'` (whatever you didn't use for hue or col)

## (Not a) Competition: whose graph looks best?
Which choices reveal the most information from the dataset?

If using all of x,y,hue,col,row is too much clutter, what would be best removed?

# Back to the Titanic

Just like we made an additional useful column `pct` in Tips, let's make an additional useful column in Titanic, the name of the column will be `who`, and the values will be `Man`, `Woman`, and `Child`. Anybody age 16 and under will be a Child. Over 16, and they will be `Man` or `Woman` depending on the value of the `sex` column.

In [None]:
# Start out creating the new column populated entirely as 'Man'
t3['who'] = 'Man'

In [None]:
t3.head()

In [None]:
t3['who'].value_counts()

The pandas DataFrame function `loc` means 'location', it allows to set only some entries of a DataFrame. Inside the `[]` are two things separated by commas: **which rows** you want to set, and **which column** you want to set.

In [None]:
which_rows = (t3.sex=='female') # list of True/False to select rows
which_col  = 'who'              # this is the new column we want to set values for

In [None]:
which_rows

In [None]:
t3.loc[ which_rows, which_col ] = 'Woman'

In [None]:
t3.head()

In [None]:
t3['who'].value_counts()

In [None]:
# but we don't have to make temporary variables like which_rows, which_col
# we can just do it in one statement:

#          rows       column     value
#        vvvvvvvvvv    vvv       vvvvv
t3.loc[  t3.age<=16,  'who' ] = 'Child'
t3.head()

In [None]:
t3['who'].value_counts()

## Titanic exercise: choose data columns
Look at those columns there, a few of them can be plotted numerically (age, fare), many are interesting categorically (survived, pclass, sex, who)

Use `sns.catplot` and choose Series for x, y, hue, row and/or col, as well as a plot kind, and optionally dodge=True.

Which configuration does the best job revealing a pattern of who is most likely to survive?

In [None]:
# Available columns to choose for x, y, and col
# 'sex'      : categorical male/female
# 'who'      : categorical Man/Woman/Child
# 'age'      : numerical years old
# 'pclass'   : passenger class 1/2/3
# 'fare'     : numerical ticket price
# 'survived' : categorical 0/1

# Choose a catplot kind (strip, swarm, box, boxen, violin, ...),
# and choose columns for x, y, and col,
# and consider whether dodge=True is better or not

sns.catplot(data=t3,
            kind='',
            x='', 
            y='', 
            col='',
            hue='survived', palette={0:'r',1:'g'},  # red for died, green for survived
            dodge=False)

# Optional: Graphing police shootings
This is one of the CORGIS links to a csv rendering of the Washington Post national database of fatal police shootings

In [None]:
wapoALL = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/police_shootings/police_shootings.csv')
wapoALL.head()

Note for this dataset, all of the column names have `.` in them, so we can't use the shortcut notation with just periods. We need the full `wapoAll['Column.Name']` notation all the time. `:-(`

In [None]:
# Just for illustration, grab a slice DataFrame for just 2016 incidents
wapo = wapoALL[ wapoALL['Incident.Date.Year']==2016 ]
wapo

Before trying this graph out, what do you expect it will look like?

In [None]:
sns.catplot(data=wapo, x='Factors.Armed', y='Person.Age',
            kind='swarm', s=3)

What went wrong up there?

In [None]:
# How are all the ways the victims were armed?
wapo['Factors.Armed'].value_counts()

In [None]:
# Let's grab just the most common

# note the common pattern of 
# wapo4 = wapo[            ...stuff-that-makes-a-list-of-T/F-to-select-rows...          ]
# but the stuff, instead of being like
#                        df['column'] <= value
# this time uses the "isin" function

wapo4 = wapo[  wapo['Factors.Armed'].isin( ['gun', 'knife', 'toy weapon', 'unarmed'] )  ]
wapo4['Factors.Armed'].value_counts()

In [None]:
sns.catplot(data=wapo4, x='Factors.Armed', y='Person.Age',
            kind='swarm', s=3)

In [None]:
wapo4['Person.Race'].value_counts()

## Police Shootings Exercises
Start over from the full DataFrame `wapoALL`.

Using the example above of four weapon types, slice `wapoALL` (repeatedly) to get a smaller DataFrame with just years 2019-2022, just those four weapon types from above, and just the three common Person.Race categories.

Then use `sns.catplot()` to make a graph that displays all those dimensions of data.

In [None]:
# what condition goes in ... to select rows with Incident.Date.Year from 2019-2022?
wapoY = wapoALL[   ...   ]

In [None]:
# see the example above to slice out the rows with the same four weapon types
wapoYW = wapoY[   ...   ]

In [None]:
# slice similarly to select only rows where Person.Race is 'White', 'African American', or 'Hispanic'
wapoYWR = wapoYW[   ...   ]

In [None]:
# Use sns.catplot() to graph the data in wapoYWR, 
# displaying all of those dimensions of the data
