# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Data Science 2: Advanced Topics in Data Science 
## Lab 1 - Introduction and Setup

**Harvard University**<br>
**Spring 2020**<br>
**Instructors:** Mark Glickman, Pavlos Protopapas, and Chris Tanner<br>
**Lab Instructors:** Chris Tanner and Eleni Kaxiras<br>
**Contributors:** Will Claybaugh, and Eleni Kaxiras

---

In [None]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

In [None]:
import numpy as np
#import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline 

## Learning Goals

The purpose of this lab is to get you up to speed with what you will need to run the code for CS109b.

## Managing Local Resources


### Forking the class repo

To get access to the code used in class you will need to clone the class repo: [https://github.com/Harvard-IACS/2020-CS109B](https://github.com/Harvard-IACS/2020-CS109B)

In order not to lose any changes you have made when updating the content (pulling) from the main repo, a good practice is to `fork` the repo locally. For more on this see Maddy Nakada's notes: [How to Fork a Repo](ForkRepo.pdf)

You will need this year's repo: `https://github.com/Harvard-IACS/2020-CS109B.git`

### Cloning the class repo and then copying the contents in a different directory so you can make changes.

* Open the Terminal in your computer and go to the Directory where you want to clone the repo. Then run 

`git clone https://github.com/Harvard-IACS/2020-CS109B.git`

* If you have already cloned the repo, go inside the '/2020-CS109B/' directory and run 

`git pull`

* If you change the notebooks and then run `git pull` your changes will be overwritten. So create a `playground` folder and copy the folder with the notebook with which you want to work there.

### Setting up your Local Environment (supported by cs109b)

#### Use Virtual Environments: I cannot stress this enough!

Isolating your projects inside specific environments helps you manage dependencies and therefore keep your sanity. You can recover from mess-ups by simply deleting an environment. Sometimes certain installation of libraries conflict with one another. 

In order of isolation here is what you can do: a) set up a virtual environment, b) set up a virtual 
The two most popular tools for setting up environments are:

- `conda` (a package and environment manager)
- `pip` (a Python package manager) with `virtualenv` (a tool for creating environments)

We recommend using `conda` package installation and environments. `conda` installs packages from the Anaconda Repository and Anaconda Cloud, whereas `pip` installs packages from PyPI. Even if you are using `conda` as your primary package installer and are inside a `conda` environment, you can still use `pip install` for those rare packages that are not included in the `conda` ecosystem. 

``` 
$ cd https://github.com/Harvard-IACS/2020-CS109B/####### REMAINING PATH TBD #######
$ conda env create -f cs109b.yml
$ conda activate cs109b
```

See here for more details on how to manage [Conda Environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

<div class='exercise'> <b> Exercise 1:  Use the cs109b.yml file to create an environment.</div>
    
We have included the packages that you will need in the `cs109b.yml` file. It should be in the same directory as this notebook. 

<div class='exercise'> <b> Exercise 2:  Clone of Fork the CS109b git repository.</div>

## Using Cloud Resources (optional)

### Using SEAS JupyterHub (supported by cs109b)

[Instructions for Using SEAS JupyterHub](https://canvas.harvard.edu/courses/65462/pages/instructions-for-using-seas-jupyterhub)

SEAS and FAS are providing you with a platform in AWS to use for the class, accessible via the 'JupyterHub' menu link in Canvas. Between now and March 1, each student will have their own t2.medium AWS ec2 instance with 4GB CPU RAM, and 2 vCPUs.  After March 1st the instances will be upgraded to p2.xlarge AWS ec2 instances with a GPU, 61GB CPU RAM, 12GB GPU RAM, 10gB disk space, and 4 vCPUs.

Most of the libraries such as keras, tensorflow, pandas, etc. are pre-installed. If a library is missing you may install it via the Terminal.

**NOTE : The AWS platform is funded by SEAS and FAS for the purposes of the class. It is not running against your individual credit. You are to use it with prudence; also it is not allowed to use it for purposes not related to this course.**

**Help us keep this service: Make sure you stop your instance as soon as you do not need it.**

### Using Google Colab (on your own)

Google's Colab platform [https://colab.research.google.com/](https://colab.research.google.com/) offers a GPU enviromnent to test your ideas, it's fast, free, with the only caveat that your files persist only for 12 hours. The solution is to keep your files in a repository and just clone it each time you use Colab.  

### Using AWS in the Cloud (on your own)

For those of you who want to have your own machines in the Cloud to run whatever you want, Amazon Web Services is a (paid) solution. For more see: [https://docs.aws.amazon.com/polly/latest/dg/setting-up.html](https://docs.aws.amazon.com/polly/latest/dg/setting-up.html)

Remember, AWS is a paid service so if you let your machine run for days you will get charged!<BR>
![aws-dog](../images/aws-dog.jpeg)

*source: maybe Stanford's cs231n via Medium*

## Packages we will need for this class

- **Clustering**:
 - Sklearn - [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)
 - scipy - [https://www.scipy.org](https://www.scipy.org)
 - gap_statistic (by Miles Granger) - [https://anaconda.org/milesgranger/gap-statistic/notebook](https://anaconda.org/milesgranger/gap-statistic/notebook)

- **Smoothing**:
 - statsmodels - [https://www.statsmodels.org/](https://www.statsmodels.org/)<br>
 statsmodels examples: https://www.statsmodels.org/stable/examples/index.html#regression<BR>
 - scipy
 - pyGAM - [https://pygam.readthedocs.io/en/latest/](https://pygam.readthedocs.io/en/latest/)

- **Bayes**:
 - pymc3 - [https://docs.pymc.io](https://docs.pymc.io)
 
- **Neural Networks**:
 - keras - [https://www.tensorflow.org/guide/keras](https://www.tensorflow.org/guide/keras)


We will test that these packages load correctly in our environment.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
digits.target # you should see [0, 1, 2, ..., 8, 9, 8]

In [None]:
from scipy import misc
import matplotlib.pyplot as plt

face = misc.face()
plt.imshow(face)
plt.show() # you should see a racoon

In [None]:
import statsmodels.api as sm

import statsmodels.formula.api as smf

# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
dat.head()

In [None]:
from pygam import PoissonGAM, s, te
from pygam.datasets import chicago
from mpl_toolkits.mplot3d import Axes3D

X, y = chicago(return_X_y=True)

gam = PoissonGAM(s(0, n_splines=200) + te(3, 1) + s(2)).fit(X, y)

In [None]:
XX = gam.generate_X_grid(term=1, meshgrid=True)
Z = gam.partial_dependence(term=1, X=XX, meshgrid=True)

ax = plt.axes(projection='3d')
ax.plot_surface(XX[0], XX[1], Z, cmap='viridis')

In [None]:
import pymc3 as pm
print('Running PyMC3 v{}'.format(pm.__version__)) # you should see 'Running on PyMC3 v3.8'

## Plotting 

### `matplotlib` and `seaborn`

- `matplotlib` 
- [seaborn: statistical data visualization](https://seaborn.pydata.org/). `seaborn` works great with `pandas`.  It can also be customized easily.  Here is the basic `seaborn` tutorial: [Seaborn tutorial](https://seaborn.pydata.org/tutorial.html).

#### Plotting a function of 2 variables using contours

In optimization, our objective function will often be a function of two or more variables. While it's hard to visualize a function of more than 3 variables, it's very informative to plot one of 2 variables. To do this we use contours. First we define the $x1$ and $x2$ variables and then construct their pairs using `meshgrid`. 

In [None]:
import seaborn as sn
x1 = np.array([1,2,3,4,5,6,7,8,9])
x2 = np.array([1,2,3,4,5,6,7,8,9])

In [None]:
x = np.linspace(-0.1, 0.1, 50)
y = np.linspace(-0.1, 0.1, 50)
xx, yy = np.meshgrid(x, y)
z = np.sqrt(xx**2+yy**2)
plt.contour(x,y,z)

## We will be using `tensorflow` and `keras`

**[TensorFlow](https://www.tensorflow.org)** is a framework for representing complicated ML algorithms and executing them in any platform, from a phone to a distributed system using GPUs. Developed by Google Brain, TensorFlow is used very broadly today. 

**[Keras](https://keras.io/)**, is a high-level API used for fast prototyping, advanced research, and production. We will use `tf.keras` which is TensorFlow's implementation of the `keras` API.

<div class="exercise"><b>Exercise 3: Run the following cells to make sure you have the basic libraries to do deep learning</b></div>

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2

tf.keras.backend.clear_session()  # For easy reset of notebook state.

print(tf.__version__)  # You should see a >2.0.0 here!
print(tf.keras.__version__)

In [None]:
# Checking if our machine has NVIDIA GPUs. Mine does not..
hasGPU = tf.config.experimental_list_devices()
print(f'My computer has the following GPUs: {hasGPU}')

<div class="exercise"><b>DELIVERABLES</b></div>

**Submit this notebook to Canvas with the output produced**. Describe below the environment in which you will be working, e.g. I have installed the two environments needed locally and have tested all the code in this notebook. 

---------------- your answer here


-----------------