# Exercise 1  - Loading and plotting the UCI wine dataset

## Goals

* Introduction to Google Colab
* Loading data into Pandas
* Indexing Pandas DataFrames
* Creating scatter plots using matplotlib

Before we get started, you'll need to install `scprep` by running the following command (press **Shift-Enter** to run the cell.) After installing, you will need to restart the notebook by clicking `Restart Runtime` which will appear below the install message.

In [None]:
!pip install --user scprep

## 1. Loading [the UCI wine dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)

### Imports
(Almost) everything in Python is imported. These lines load modules from four libraries:
* `numpy` - *the* library for numerical computing in Python
* `pandas` - a library for organizing and manipulating data
* `matplotlib` - a library for plotting
* `sklearn` - short for scikit-learn, a machine learning toolkit in Python

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets

### Loading the data from sklearn into a pandas DataFrame

The core of pandas is the [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). The `DataFrame` is a two-dimensional container for data that is organized into rows and columns. Generally speaking, rows should correspond to observations in a DataFrame, and columns should correspond to the features of variables of the data.

#### Load data

In [None]:
wine = datasets.load_wine()

#### Extract information to put in DataFrame
When you load data from sklearn, it is packaged into a Bunch (like a dict). We want the data in a pandas DataFrame so we can work with it easily.

In [None]:
# Access the numerical data from the wine Bunch
data = wine['data']

data

In [None]:
# Load data about the rows and columns
feature_names = wine['feature_names']

# Load cultivar information about each wine
cultivars = np.array(['Cultivar{}'.format(cl) for cl in wine['target']])

# Create nice names for each row
wine_names = np.array(['Wine{}'.format(i) for i in range(data.shape[0])])

#### Create DataFrame

In [None]:
# Gather all of this information into a DataFrame
data = pd.DataFrame(data, columns=feature_names, index=wine_names)

# Print the first 5 rows of the data, eq. to data[:5]
data.head()

## 2. Looking at the data in the DataFrame

In [None]:
# Access the 'total_phenols' for all wines
data['total_phenols']

In [None]:
# Access the first 6 wines
data.iloc[:6]

In [None]:
# Print the 'malic_acid', 'magnesium', and 'color_intensity' values for the first 3 wines
# First, select columns, then select rows
data[['malic_acid', 'magnesium', 'color_intensity']][:3]

### Exercise 1 - Print the 'proanthocyanins' and 'flavanoids' for wines 100-111

In [None]:
# ===============
# select the appropriate columns with the data['column'] syntax and the appropriate wines with [] indexing

# ===============

#### _Breakpoint_  - once you get here, please help those around you!

## 3. Introduction to plotting data

We're not going to go in depth on plotting, because that could take a full day to go through. There are plenty of online learning resources of getting the hang of matplotlib, though. 

We recommend:
* [The official matplotlib pyplot tutorial](https://matplotlib.org/users/pyplot_tutorial.html)
* [The free Cognitive Class Data Visualization Course](https://cognitiveclass.ai/courses/data-visualization-with-python)

### Exercise 2 - Plot the alcohol and hue of the first 10 wines in the dataset

In [None]:
# ===============
# Get the first 10 rows and the 'hue' and 'alcohol' columns of the data
plotting_data = 
# ===============

# Create the Figure and Axis objects
fig, ax = plt.subplots(1, figsize=(4.5,4))

# Call the scatter() method of the axis object
ax.scatter(plotting_data.iloc[:,0], plotting_data.iloc[:,1])

# If you were running this on your computer, you could save the images
fig.tight_layout()

## 4. Introducing `scprep`

`scprep` is a toolkit for analysis of single cell data. It's open source and publicly available on GitHub at https://github.com/KrishnaswamyLab/scprep. You can install this using pip with `pip install scprep`.

Like most packages, `scprep` is divided into modules. You can find a full list of the available modules and documentation in the `scprep` documentation at https://scprep.readthedocs.io/.

### Plot the alcohol and hue of all wines in the dataset *using scprep*

It will be helpful for you to look at [the documentation for `scprep.plot.scatter2d()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.scatter2d).

In [None]:
import scprep

In [None]:
# =======
# replace '...' with the correct code
scprep.plot.scatter2d( ... , 
                     c=cultivars, # We created this when importing the wine data
                     title='Hue and % Alcohol of all wines', # Goes at the top
                     xlabel='Alcohol', ylabel='Hue', # Labels the x and y axis
                     # This legend_anchor position puts the legend outside the plot
                     legend_title='Variety', legend_anchor=(1,1), 
                     figsize=(5.5,4))
# =======

#### _Breakpoint_  - once you get here, please help those around you!

## Identify features that differentiate two cultivars

In the exercise, each pair per table will pick a pair of cultivars and identify features that differentiate them using histograms. 

It will be helpful for you to look at [the documentation for `scprep.plot.histogram()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.histogram).

As a pair:
1. Pick two of the three cultivars to study
2. Go through the features one by one and plot their histograms (remember you can duplicate code blocks!)
3. Identify the feature that best differentiates your two cultivars (however you define 'best')
4. Once you are done, share your findings with your table

In [None]:
# Pick a feature in the data
feature = data['alcohol']

# Type in your desired cultivars here
a_group = 'Cultivar0'
b_group = 'Cultivar1'

scprep.plot.histogram([feature[cultivars == a_group], feature[cultivars == b_group]],
                      title=feature.name, bins=20,
                      xlabel='Feature value',ylabel='Frequency')

#### _Breakpoint_  - once you get here, please help those around you!

## 4. Calculating pairwise distances

To calculate pairwise distances (i.e. the distance between two points), we will use the [`pdist`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) function from `scipy`. Next, we will visualize the data using a heatmap. Right now, there's no function for plotting heatmaps in `scprep`, because another package, seaborn, already has support for comprehensive plotting of heatmaps.

Package descriptions:
* [`scipy`](https://docs.scipy.org/doc/scipy/reference/index.html) - part of the same ecosystem as `numpy`. Contains many functions for scientific computing in Python including statistical computations, linear algebra, etc.
* [`seaborn`](https://seaborn.pydata.org/) - an independent plotting package for Python that wraps around `matplotlib` to create beautiful, complex plots.



### Imports

In [None]:
# we only need a couple of functions from scipy.spatial.distance, so we'll directly import them
from scipy.spatial.distance import pdist, squareform

# it is common to import seaborn as sns
import seaborn as sns

#### Calculating pairwise distances

In [None]:
pairwise_distances = squareform(pdist(data))

#### Plotting the heatmap

Here we're using the `clustermap` function to get the row and column colors, but turning off the clustering component with `row_cluster=False` and `col_cluster=False`.

Documentation for `clustermap` can be found here: https://seaborn.pydata.org/generated/seaborn.clustermap.html

In [None]:
cg = sns.clustermap(pairwise_distances, 
               row_colors=plt.cm.tab10(wine['target']), 
               col_colors=plt.cm.tab10(wine['target']), 
               row_cluster=False, col_cluster=False,
               xticklabels=[], yticklabels=[])

cg.ax_heatmap.set_xlabel(r'Cell$_{i}$', fontsize=22)
cg.ax_heatmap.set_ylabel(r'Cell$_{j}$', fontsize=22, rotation=90)

### Discussion

Try commenting out (adding a "#" at the beginning of the line) the line that sets `row_cluster=False` and `col_cluster=False`. This will make `clustermap` automatically cluster the data. 

Discuss the following questions:
1. What do you notice about the groupings? 
2. Do you think the wine cultivars are "Euclidean blobs"?

#### _Breakpoint_  - once you get here, please help those around you!

### Why feature scaling is important

We'll use `sklearn`'s `StandardScaler` to z-score the features of the wine dataset. You'll see that a heatmap of the data _without_ doing this is dominated by a single high-magnitude feature, which is much less informative.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
data_scaled = StandardScaler().fit_transform(data)

In [None]:
sns.clustermap(data_scaled, xticklabels=data.columns, row_colors=plt.cm.tab10(wine['target']))

Notice that the data clusters together (as you see by the colors on the left, which corresponds to the cultivars) quite nicely. Now let's try again without the scaling.

In [None]:
sns.clustermap(data, row_colors=plt.cm.tab10(wine['target']))

#### Scaling before distance calculation

We can also take another look at our pairwise distances plot and see what scaling does to the output.

In [None]:
pairwise_distances_scaled = squareform(pdist(data_scaled))

cg = sns.clustermap(pairwise_distances_scaled, 
               row_colors=plt.cm.tab10(wine['target']), 
               col_colors=plt.cm.tab10(wine['target']), 
               row_cluster=False, col_cluster=False,
               xticklabels=[], yticklabels=[])

cg.ax_heatmap.set_xlabel(r'Cell$_{i}$', fontsize=22)
cg.ax_heatmap.set_ylabel(r'Cell$_{j}$', fontsize=22, rotation=90)

You can see that the three-cluster structure is much more clear here than when we just used the raw data, where we only really saw two obvious groups. If we allow `seaborn` to cluster the rows and columns, we'll see the structure more clearly.

### Discussion

Again, commenting out (adding a "#" at the beginning of the line) the line that sets `row_cluster=False` and `col_cluster=False`. This will make `clustermap` automatically cluster the data. 

Discuss the following questions:
1. What do you notice about the groupings? 
2. Now do you think the wine cultivars are "Euclidean blobs"?