<a href="https://colab.research.google.com/github/ContextLab/cs-for-psych/blob/master/slides/module_4/pandas_and_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring `pandas` and `scikit-learn`

In this notebook we'll explore some more advanced features of two popular Python libraries:
- [pandas](https://pandas.pydata.org/): **P**ytho**N** **DA**ta **S**cience Library is essentially a wrapper for [numpy](https://numpy.org/) that provides really useful tools for bookkeeping and manipulating data.  If you're familiar with [`data.frame`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) objects or the [`plyr`](https://www.rdocumentation.org/packages/plyr/versions/1.8.6) or [`dplyr`](https://dplyr.tidyverse.org/) libraries in R, you'll see lots of similarities to Pandas `DataFrame` objects and how to work with them.  As a historical note, Pandas was inspired by R dataframes and R's "[tidyverse](https://www.tidyverse.org/)" libraries for data science.
- [scikit-learn](https://scikit-learn.org/stable/): a machine learning package for Python that implements many of the most widely used algorithms for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.  Scikit-learn is built on top of [numpy](https://numpy.org/), [pandas](https://pandas.pydata.org/), and [matplotlib](https://matplotlib.org/stable/index.html).

Before digging into either of these libraries, here are some useful orienting guidelines that might help you to think about what each library does:
- Numpy's main feature is that it introduces the `array` datatype.  Arrays are like matrices or tensors-- *n*-dimensional tables of numbers.From an implementation standpoint, arrays are just like nested lists, where the outer-most level corresponds to the first dimension, the second outer-most level corresponds to the second dimension, and so on (where the inner-most level corresponds to the last dimension).  Because arrays are built on lists, you can often pass nested lists directly into numpy functions (as though they were arrays) and they'll be treated just like arrays.  Once you've organized your data into one or more arrays (or array-like objects), numpy allows you to apply a wide variety of linear algebra operations to the data.  Numpy is also written in efficient C code, which means that you can work effectively with very large datasets.
- Pandas's main feature is the introduction of the `DataFrame` datatype.  From an implementation standpoint, DataFrames are like Python dictionaries where one key, called "`values`" points to the data (stored as a 2D numpy array); a second key, called "`index`" points to another numpy array that contains labels for the rows; and a third key, called "`columns`" that points to a numpy array containing labels for the columns.  Once you've created a DataFrame, you can mostly treat it just like a 2D numpy array-- many numpy functions "just work" on DataFrames (e.g., it'll usually work fine if you just pass DataFrames directly to numpy methods that are expecting arrays).
- Scikit-learn really provides two general tools:
  - First, the library includes implementations of a wide variety of algorithms for doing the most widely used machine learning tasks.  This typically requires your data to be organized into numpy arrays and/or pandas DataFrames and or lists.  
  - Second, the library includes a general framework for *organizing* code for implementing machine learning algorithms.  This is arguably the most powerful and furthest-reaching contribution of scikit-learn.  For example, nearly every model in scikit-learn is implemented as a Python class with a `fit` method (which takes in a training dataset and trains or applies the given model) and a `transform` method (which takes in potentially new data and projects it through the given model).  The common structure across models means that it's relatively straightforward to implement new models or features that will then play nicely with the other functionality in the library.
    - Additional related libraries extend the functionality of scikit-learn even further.  For example, [scikit-image](https://scikit-image.org/) adds image processing algorithms; [scikit-network](https://scikit-network.readthedocs.io/en/latest/) adds graph theory algorithms; [scikit-optimize](https://scikit-optimize.github.io/stable/) adds some additional optimization algorithms; and so on.
    - There is some redundancy between scikit-learn and other popular libraries.  For example, scikit-learn includes some deep learning models and tools.  However, most of these implementations are less efficient and less flexible than libraries like [tensorflow](https://www.tensorflow.org/) or [pytorch](https://pytorch.org/) that are focused specifically on deep learning, rather than on "machine learning" in general.
    - A reasonable rule of thumb might be to implement basic ideas using scikit-learn as a way to get things "up and running" on a test dataset or application.  But then if you want to scale things up to a much larger dataset you may want to port things over to another library.

# Library imports

In [1]:
import numpy as np
import pandas as pd
import sklearn as skl
import matplotlib as mpl
import seaborn as sns

# Datasets

We'll play around with a "toy" dataset included with Seaborn:
  - A list of 891 Titanic passengers and various compiled pieces of information about then

We'll also look at two datasets from [fivethirtyeight](https://fivethirtyeight.com/):
  - Guests that appeared on Jon Stewart's 'The Daily Show' (inspired by [this article](https://fivethirtyeight.com/features/every-guest-jon-stewart-ever-had-on-the-daily-show/)
  - Superbowl commercials (inspired by [this article](https://projects.fivethirtyeight.com/super-bowl-ads/))

We can use the pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to load in data stored in a CSV file.  Analogous pandas functions can read in data stored in a wide variety of formats including Excel, JSON, HDF, SAS, SPSS, SQL, BigQuery, STATA, [and more](https://pandas.pydata.org/pandas-docs/stable/reference/io.html). Most of these functions support reading both from locally stored files *or* directly from a remote URL.

In [2]:
titanic = sns.load_dataset('titanic')
daily_show_guests = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv', header=0)
superbowl_ads = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/superbowl-ads/main/superbowl-ads.csv', header=0)

It's always good to check that the dataset was loaded in correctly; I like to use the `head` function (which prints out the first 5 rows of the DataFrame by default; you can customize how many lines are printed by passing in any non-negative integer).  The `tail` function behaves similarly, but it prints out the *last* rows of the table.

In [3]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
daily_show_guests.head(20)

Unnamed: 0,YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List
0,1999,actor,1/11/99,Acting,Michael J. Fox
1,1999,Comedian,1/12/99,Comedy,Sandra Bernhard
2,1999,television actress,1/13/99,Acting,Tracey Ullman
3,1999,film actress,1/14/99,Acting,Gillian Anderson
4,1999,actor,1/18/99,Acting,David Alan Grier
5,1999,actor,1/19/99,Acting,William Baldwin
6,1999,Singer-lyricist,1/20/99,Musician,Michael Stipe
7,1999,model,1/21/99,Media,Carmen Electra
8,1999,actor,1/25/99,Acting,Matthew Lillard
9,1999,stand-up comedian,1/26/99,Comedy,David Cross


In [5]:
superbowl_ads.head(10)

Unnamed: 0,year,brand,superbowl_ads_dot_com_url,youtube_url,funny,show_product_quickly,patriotic,celebrity,danger,animals,use_sex
0,2018,Toyota,https://superbowl-ads.com/good-odds-toyota/,https://www.youtube.com/watch?v=zeBZvwYQ-hA,False,False,False,False,False,False,False
1,2020,Bud Light,https://superbowl-ads.com/2020-bud-light-seltz...,https://www.youtube.com/watch?v=nbbp0VW7z8w,True,True,False,True,True,False,False
2,2006,Bud Light,https://superbowl-ads.com/2006-bud-light-bear-...,https://www.youtube.com/watch?v=yk0MQD5YgV8,True,False,False,False,True,True,False
3,2018,Hynudai,https://superbowl-ads.com/hope-detector-nfl-su...,https://www.youtube.com/watch?v=lNPccrGk77A,False,True,False,False,False,False,False
4,2003,Bud Light,https://superbowl-ads.com/2003-bud-light-hermi...,https://www.youtube.com/watch?v=ovQYgnXHooY,True,True,False,False,True,True,True
5,2020,Toyota,https://superbowl-ads.com/2020-toyota-go-place...,https://www.youtube.com/watch?v=f34Ji70u3nk,True,True,False,True,True,True,False
6,2020,Coca-Cola,https://superbowl-ads.com/2020-coca-cola-energ...,https://www.youtube.com/watch?v=-gAZRN3SCBw,True,False,False,True,False,True,False
7,2020,Kia,https://superbowl-ads.com/2020-kia-tough-never...,https://www.youtube.com/watch?v=lMs79UXam9A,False,False,False,True,False,False,False
8,2020,Hynudai,https://superbowl-ads.com/2020-hyundai-smaht-p...,https://www.youtube.com/watch?v=WBvkmWDjsYc,True,True,False,True,False,True,False
9,2020,Budweiser,https://superbowl-ads.com/2020-budweiser-typic...,https://www.youtube.com/watch?v=J0xugdotpp8,False,True,True,True,True,False,False


# Your mission, should you choose to accept it...

![mission impossible](https://66.media.tumblr.com/55af7b8e38e169303773d92d4c0e74a0/195050b77510a10c-f0/s640x960/679b8fd686c59e7fdd8d66c5e63965edeff54c9f.gif)

Individually and/or in breakout groups, you'll use the functions and hints below to carry out a series of data science "missions" on the above datasets.  You'll have just a few minutes to tackle each mission.

## Mission *goals*

The overarching goals are to:
- Increase familiarity with some advanced features of Pandas and Scikit-learn
- Learn how to quickly read an API
- Learn how to Google examples code snippets and adapt them to your needs
- Communicate your thinking with others

## Mission *parameters*

I suggest that you attack each problem using the following heuristic:

1. Make sure you understand the problem:
  - What does the *input* or *data* look like?
  - What does the *output* or *solution* look like?
2. Understand which function(s) you're going to use:
  - What libraries will you need?
  - What's the syntax for using the function(s)?  (Hint: look up the library's API and then search for those functions!)
  - What format does your dataset need to be in?  How can you wrangle it into that format?
3. Create (or find!) a simple example of how to use the function(s)
  - Good places to start your search: Google, Stack overflow, online tutorials or demos, GitHub repositories for relevant projects
  - Don't worry (initially) about starting with the dataset you'll be applying your analysis to-- it may be easier to create a fake dataset (or download another existing dataset) that is already in the proper format.  You can use this as your "dev" dataset.
4. Wrangle your dataset into the format you'll need it in.  This may involve:
  - Removing rows/columns  
  - Modifying existing values
  - Filling in missing values
  - Adding rows or columns, potentially based on values in other rows/columns
  - Changing data types (e.g., converting strings to integers, etc.)
    - It's worth spot-checking your dataset to make sure you *know* the datatype for each entry
  - Make sure that values are consistent-- e.g., standardize capitalization, formats of dates, spelling, etc.  The `unique` function (in numpy or pandas) can be really useful here.
5. Apply your function to your dataset:
  - Start by copying the syntax you used in your "dev" dataset
  - Then adjust or adapt any arguments or syntax to reflect your "target" ("real") dataset of interest
6. Examine the result:
  - Is the result the expected size?
  - Do the rows and columns look right?
  - Spot check a few entries (use `head` and `tail`, and then also select out a few rows/columns at random to check them).  More data implies that you should do more spot checks.  And if you have a small enough dataset (or if the dataset is important enough) you may want to manually check over the entire result.

## Mission *completion*

When you've achieved your mission objectives you should clean up your code so that you can prepare it for sharing with the class.  If you're working in a group, pick one person from your group to share their screen, and (roughly) plan out who will talk.  Aim for a quick debriefing-- try to take no more than 5 minutes (and ideally closer to 2ish minutes) to explain (a) your problem and (b) your solution.

# Mission list: Pandas

## Titanic Dataset
1. Use `groupby` (and, if needed, `apply`) to create a DataFrame of the *average* (hint: `groupby.mean`) probability of survival, fare, age, proportion of males, and probabilty of traveling alone by:
  - Deck (`deck`)
  - Class (`pclass`)
2. Use `aggregate` to create a dataframe of the *minimum*, *median*, and *maximum* values for fare and age
3. Create a series of bar plots showing survival probability by fare, age, deck, and class.  Split each bar based on passenger gender.

## Daily Show Dataset
4. Use `resample`, `apply`, and `rolling` to create a timeline of the proportion of guests from each category of profession (e.g., actor, comedian, model, politician, etc.) within a rolling 1-week window.  You may also find the `to_datetime` function useful.
5. Same as the fourth mission, but use `expanding` instead of `rolling` to create a timeline of the *total* proportion of guests from each category of profession up to each successive date in the show.
6. Create a bar plot showing the proportion of guests from each group (`Acting`, `Comedy`, `Musician`, etc.) by year.  Each year's group of bars should sum to 1.
7. Create a bar plot showing the number of guests from each group (`Acting`, `Comedy`, `Musician`, etc.) by year.  Each year's group of bars should sum to the total number of guests from that year.

## Superbowl Dataset
8. Use `groupby`, `apply`, and `mean` to create a DataFrame indexed by *brand* of the proportions of ads by each company with each attrubute (`funny`, `show_product_quickly`, `patriotic`, `celebrity`, `danger`, `animals`, and `use_sex`)
9. Create a DataFrame indexed by each *attribute* (`funny`, `show_product_quickly`, `patriotic`, `celebrity`, `danger`, `animals`, and `use_sex`), whose columns are the unique brands (`Toyota`, `Bud Light`, `Hynudai`, etc.).  Display the proportion of all ads with the given attribute (row) associated with each brand (column).  Hint: you can use row selection functions (`query`, `loc`, `iloc`, `xs`, `take`) to select out the rows with each given attribute.  Then use `aggregate` to compute proportions.  You may also find the `transform` and/or `stack` functions useful.

# Mission list: Scikit-learn

## Titanic Dataset
1. Train a `LogisticRegression` classifier to predict probability of survival based on the following attributes: `pclass`, `sex`, `age`, `fare`, `class`, `deck`, and `alone`.  Note: you'll need to first convert `sex` and `class` into numerical variables.  To obtain an ubiased estimate of classifier accuracy, randomly divide the data into a training and test set (use 75% of the data to train the model and evaluate performance on the remaining 25%).  Repeat this procedure 100 times (with new random assignments of training and test data each time) to obtain a distribution of classification accuracies (on the test data). The `train_test_split` function may be useful.
2. Repeat the previous mission, but using the `CategoricalNB`, `DecisionTreeClassifier`, and `SVC` classifiers.  Create a bar graph showing average classification accuracy (with 95% confidence interval error bars) for each type of classifier (also include `LogisticRegression`).
3. Use `KMeans` clustering to cluster participants into *k* = 3 groups according to the following features: `pclass`, `sex`, `age`, `fare`, `class`, `deck`, and `alone`:
  - Create a separate DataFrame for each group
  - Plot the probability of survival for each group
  - Explore how the group assignments change with other clustering algorithms (e.g., `DBSCAN`, `Birch`)
4. Create a feature vector for each passenger using the following features: `pclass`, `sex`, `age`, `fare`, `class`, `deck`, and `alone`.  Use `PCA` to project each passenger's feature vector onto two dimensions.  Create a scatterplot (component 1 vs. component 2) where each passenger's marker is assigned as follows:
  - *shape*: a circle indicates that the passenger survived; an x indicates that the passenger died
  - *size*: scale the marker sizes to indicate the fare paid by the passenger
  - *color*: yellow = first class; orange = second class; red = third class
5. Explore how the above scatterplot changes using the following dimensionality reduction algorithms: `DictionaryLearning`, `ICA`, `NMF`.

## Superbowl Dataset
6. Train `CategoricalNB` and `SVC` classifiers to predict `brand` from the following attributes: `funny`, `show_product_quickly`, `patriotic`, `celebrity`, `danger`, `animals`, and `use_sex`.  Create a bar graph showing the average classification accuracy (on held-out test data) for each type of classifier, with 95% confidence interval error bars.
7. Create a `dendrogram` showing which brands create similar ads (based on the attributes `funny`, `show_product_quickly`, `patriotic`, `celebrity`, `danger`, `animals`, and `use_sex`).

# Going Rogue...
![going rogue](https://i.gifer.com/origin/4b/4bd1ef2830b4059cc55245e786a01cc9.gif)

As your final mission, discover or explore something interesting about any of the datasets we've explored, using any of the features of `pandas` and/or `scikit-learn`.

# Concluding thoughts

In attempting these problems, I'm hoping you've taken away a few key things:

1. These libraries are feature-rich and useful for a wide range of applications.  I encourage you to think about how you might use these libraries in your research.
2. Each library has its own "feel" or "style" to it.  The best way to build up intuitions about different libraries is to actually use them, ideally on real datasets, by turning questions about the data into analyses that you can implement using the given library.
3. You'll get used to working with a subset of the functions from each library that you use most.  But for most other functions (and often even for the ones you know), you'll likely find yourself looking up syntax or examples very frequently.  Learning how to find what you want is way more useful and important than attempting to memorize syntax.

We could easily take an entire term digging into any of these libraries.  This course can introduce you to some basic functionality and intuitions, but the "next step" in your learning journey is to learn how to "train yourself" to use these libraries (and others, including libraries that haven't been written yet!) by relying on your intuitions, prior knowledge about coding and other libraries, and any documentation or examples you can track down.