# Principal Component Analysis

In this assignment, you'll refresh your skills working with `pandas` and `numpy`, while implementing a common data analysis technique called Principal Component Analysis (PCA). PCA is a *dimensionality reduction technique*, which can often be used to obtain relatively simple visualizations of complex data sets. [Here](https://en.wikipedia.org/wiki/Principal_component_analysis) is an overview of PCA. [Here](https://www.stat.cmu.edu/~cshalizi/uADA/16/lectures/17.pdf) you can find an *optional* discussion of PCA with heavy mathematical and statistical flavor.  

In this assignment, your task is to create a 2d visualization the *Palmer Penguins* data set. The Palmer Penguins data set was collected by collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/). It contains measurements on three penguin species: Chinstrap, Gentoo, and Adelie. 

<figure class="image" style="width:30%">
  <img src="https://allisonhorst.github.io/palmerpenguins/man/figures/lter_penguins.png" alt="Three stylized penguins, one each of the species Adelie, Gentoo, and Chinstrap, with labels above their heads and patches of color behind them.">
  <figcaption><i>Illustrations of the penguin species in the Palmer Penguins data set, by Allison Horst.</i></figcaption>
</figure>

If you took PIC16A with me, you've already seen this data set. 

<figure class="image" style="width:30%">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/_images/not-done.jpg" alt="">
  <figcaption><i></i></figcaption>
</figure>

If not, that's ok! Ask a classmate about their beautiful adventure among the penguins. 

The data set is hosted at: 

```
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
```

You can read it into Python as a `pandas` data frame via `pd.read_csv(url)`. 

Your ultimate product should be an attractive, polished visualization of the penguin species, plotted against the first and second principal components. Here is an example: 

<figure class="image" style="width:50%">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/HW/penguin_pca.png" alt="">
  <figcaption><i></i></figcaption>
</figure>

Your plot may look somewhat different from mine depending on your implementation. 


## Side Note: Contextualizing Your Data

In data science, it is important to develop a thorough understanding of the data that you are attempting to model. I recommend consulting [these](https://www.youtube.com/watch?v=ZbASA6fZaRI) [helpful](https://www.youtube.com/watch?v=M5UlTRrVaTk) [videos](https://www.youtube.com/watch?v=RoTVc32TLx8) to build your knowledge of penguin behaviors outside their natural habitats.  

## Specs and Hints

- You must implement PCA by hand, using `numpy` linear algebra methods. You'll need to multiply matrices (2d `numpy` arrays) and extract eigenvalues and eigenvectors. Yes, there is a method `sklearn.decomposition.PCA`. No, you should not use it for this homework. 
    - Corollary: though I have suggested reading the data as a `pandas` data frame, you will eventually need to convert it into a `numpy` array. 
- You will likely want to standardize your data columns. To *standardize* a column, compute the mean and standard deviation of that column. Subtract that mean from each entry of the column, and divide the result by the standard deviation. One way to do this is to figure out how to standardize a single data column (or `1d numpy` array) and then apply that to each column of the data. An alternative approach is to use `numpy` array broadcasting. 
- You should use, at minimum, at least five columns of the data, other than `Species`. If you choose to use any categorical columns (such as `Island` or `Sex`, you will need to first encode it as a numeric column. A [label encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) will be helpful. 
- It is possible to create the specified plot using `matplotlib`. However, you may find it easier to use `seaborn`. 
- You will need to handle `NA` values in the data in some way, and explain your choice. 
- With the possible exception of looping over species to create the plot itself, there is no need to use `for`-loops in this problem. 

## Writing

You should write this assignment in the format of a *tutorial*. That is, you should write with the aim of teaching someone how to perform and visualize PCA in Python. Your code should contain comments and docstrings, while your surrounding text should explain the overall purpose and structure of your approach. For concreteness, you can imagine that you are explaining PCA to "you from two weeks ago." 