# Lecture 1: Introduction to Datahub, Jupyter, and Terminal
## 20 Sep 2022

### Table Of Contents
* [Introduction](#section1)
* [What will we learn?](#section2)
* [Homework and Submissions](#section3)


### Hosted by and maintained by the [Student Association for Applied Statistics (SAAS)](https://saas.berkeley.edu). 


<a id='section1'></a>
# Introduction
Hello! Welcome to Career Exploration Fall 2022!

This is just an introductory notebook for practice working with datahub and discussing the semester schedule. 

Datahub is a fantastic resource as it allows us to utilize python and common packages without needing to install a bunch of stuff and having that break.

Run the code chunk below by clicking on it and pressing `shift enter` (or `shift return`) on mac. These are all common packages we will import throughout the semesters.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

Ordinarily you would need to install python as well as all the packages and then sometimes stuff doesn't work. This usually causes many problems with differing versions and installation issues. Datahub bypasses all of these by providing a environment you can use online! 

### Steps to download from Slack and unzip on Datahub.         
1. Make sure you are in the slack workspace, and in the career exploration committee channel.
2. Download the LectureX.zip file
1. Open datahub at http://datahub.berkeley.edu/ and log in with your berkeley account
2. Click upload at the top right
3. Upload LectureX.zip (X represents the lecture number, for example Lecture1.zip)
4. Select 'new' at the top right of the datahub screen, and select terminal from the drop down
5. Enter "unzip LectureX.zip"
  * `unzip LectureX.zip`
6. Open the LectureX folder and open the ipynb file inside the LectureX folder


Our main source of file sharing will be uploading to slack. Remember to upload the entire zip file to Datahub and unzip. 

<a id='section2'></a>
# What will we learn?

This semester will go over many topics on a relatively high level. We begin with introducing jupyter notebooks (what you are reading from right now!) and use these to teach most of our lectures. Jupyter notebooks are incredibly useful as they allow you to run separate chunks of code at a time, without having to run the entire program at once.

We aim to go through the following topics for the semester.

<table class="table table-bordered table-hover table-condensed">
<thead><tr><th title="Field #1">Date</th>
<th title="Field #2">Lecture</th>
<th title="Field #3">Description</th>
</tr></thead>
<tbody><tr>
<td>9/20</td>
<td>L1 Software for Class and Logistics</td>
<td>Jupyter Notebooks, Datahub, Terminal</td>
</tr>
<tr>
<td>9/27</td>
<td>L2 Fundamental Python</td>
<td>Lists/Dictionaries, Functions, Iterations</td>
</tr>
<tr>
<td>10/4</td>
<td>L3 Numpy/Pandas</td>
<td>Numpy Arrays, Pandas, Data Frames</td>
</tr>
<tr>
<td>10/11</td>
<td>L4 EDA and Data Visualization</td>
<td>EDA, Matplotlib, residual plots, box plots, histograms, QQ plots, normal distribution</td>
</tr>
<tr>
<td>10/18</td>
<td>L5 Probability for Data Science</td>
<td>Counting, probability, expectation, variance, discrete distributions, continuous distributions</td>
</tr>
<tr>
<td>10/25</td>
<td>L6 Linear Algebra for Data Science</td>
<td>Matrices, vectors, linear regression, R^2, residual plots, linear models, transposes, inverses, normal equation, column spaces, orthogonality</td>
</tr>
<tr>
<td>11/1</td>
<td>L7 Applications of Math/Stats/Code pt1</td>
<td>Under/overfitting, loss functions, classification vs regression, training/validation/testing, bias, variance</td>
</tr>
<tr>
<td>11/8</td>
<td>L8 Applications of Math/Stats/Code pt2</td>
<td>Ridge regression, k-fold cross validation, lasso, hyperparameters, gradient descent, OLS</td>
</tr>
<tr>
<td>11/15</td>
<td>L9 Classification Part 1</td>
<td>Logistic regression, loss function, model evaluation, decision boundaries</td>
</tr>
<tr>
<td>11/29</td>
<td>L10 Classification Part 2</td>
<td>Decision trees, random forest, k-nearest neighbors, hyperparameter tuning</td>
</tr>
</tbody></table>

As you can see, the semester is packed full of various concepts, from statistical ideas such as bias and variance to machine learning concepts like neural networks and decision trees.

The semester is structured so that you will be able to accumulate foundational skills, learn more advanced concepts, and apply them to a final Kaggle competition. 

The course material is being written by our lovely Education committee! You will get to meet them over the course of the semester as we are rotating lecturers.

This schedule is quite ambitious and fast paced as it aims to cover a very large amount of material. 

**Please let us know if you ever have feedback, have questions, or you are just looking for some more help! We are all happy to help out. You can always reach us over slack.**

**This material is hard!**

We also hold many workshops and socials over the semester! We hope that you are all able to come participate and have a great time!

<a id='section3'></a>
# Final Project/Kaggle Competition

This semester we will have a final project in the form of Kaggling Competition. It will be a good way to apply whatever you have learnt from the lecture to solve a practical problem!

We will release two notebooks (one for EDA and the other one for modeling) to help you progress your data analysis. The EDA notebook will be released at 10/11, and the modeling notebook will be released at 11/01.

The due of the entire project/competition is 12/09, which is the Friday of the RRR week. Education committee members will hold OH for the project at the week before the RRR week.


# Homework
This semester, we will be having **mandatory** homework problem sets to complement each lecture. We expect each homework set to take 1-2 hours and to serve as good independent practice for relevant topics. Solutions will also be released for each homework set.

You will submit your homework solutions to the Gradescope. The instructors of each lecture will tell you the format (like pdf or notebook) of your submission.

# Terminal

Previously, we have already used the terminal to unzip the course materials. Now let's learn more basic operations of it!

Basic commands:

* `cd path/to/directory/`: (Change Directory) change the directory that you’re currently working with to `path/to/directory/` in the Terminal in order to execute other commands on a different directory. If `path/to/directory/` is empty, i.e., you don't type anything following `cd`, it will go back to the root directory.
* `ls path/to/directory/`: (Listing Directory) view the contents (files and directories) inside of the `path/to/directory/` directory. If `path/to/directory/` is empty, i.e., you don't type anything following `cd`, it view the contents in the current directory.
* `cp filename1 filename2`: (Copy a file to another directory) copy a file from one location (`filename1`) to another location (`filename2`).
* `mv filename1 filename2`: (Move a file) has the same logics as `cp`, but you "move" instead of "copy".
* `rm filename`: (Remove a file) delete/remove `filename`. If you want to remove a (nested) folder, use `rm -rf filename`.

Basic Symbols:
* `.` represents the current directory.
* `..` represents the parent directory.
* `*` represents anything. For example, `cp folder1/* folder2` means copy all the files under `folder1` to `folder2`. `rm *.csv` represents deleting all the csv files in the current directory.

[Reference](https://www.techrepublic.com/article/16-terminal-commands-every-user-should-know/)

Use the commands above to do the following operations:

1. Go to the `Excerise` directory
2. View the contents in the current directory. What are they?
3. View the contents in `Berkeley` and `Stanfurd` directory. What are they?
4. Go to the `Berkeley` directory
5. Go back to the `Excerise` directory
6. Remove everything in `Stanfurd`
7. Copy everything in `Berkeley` to `Stanfurd`

Good job! You have turned Stanford into a Bear territory!

<a id='section4'></a>
# DataHub/Jupyter Notebook Guide

Datahub will be the place where all your code will reside. In some ways this is your development environment! You should always be familiar with the environment you program in, and here are some exercises to help you get started!

**Double click** a cell to edit the contents. 

There are two main kinds of cells we will be expecting you to know, Markdown and Code. 
You can run a cell by pressing **ctrl-enter**.
Code: Really self explanatory. This is where your code is stored.
Markdown: All the text stuff. There is also some latex integration! w $\alpha$ o! Just put $ around your markdown code.

The **kernel** is something you find might crash a lot in the future. If the kernel does not work, your code does not work. Look at the top right of the notebook to see the status of the kernel.

As the wise IT guys always say, try turning it off and on again if it doesn't work. To restart the kernel, try to find the "kernel" section near the top left of the notebook. This should be in the dropdown menu.

The **toolbar** at the top of the notebook also holds some pretty useful tools! 

**Note**: Here is an optional [reading material](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330) about more shortcuts in jupyter notebook .

<h3>Q1</h3>

Look in the toolbar an try to find the **Run All Above** command. 

What tab is it in?

<h3> Q2 </h3>

Delete the cell below

*DELETE ME PLEASE*

<h3> Q3 </h3>

**In the cell below, write a number 1 to 100 inclusive.**

*Make me a number*

<h3> Q4 </h3>

In the cells below, write your name, major, a fun fact about yourself. Make sure to hit Save (File > Save and Checkpoint) or Ctrl/Command-S after you've finished writing. 

**Name**: 

**Major**: 

**Fun Fact**: 

<h3> Q5 </h3>

Run the cell below to make sure everything runs fine. 

In [None]:
func = lambda x: 4*x+2
samples = 100
data_range = [0, 1]


x = np.random.uniform(data_range[0], data_range[1], (samples, 1))
y = func(x) + np.random.normal(scale=3, size=(samples, 1))
model = LinearRegression().fit(x, y)
predictions = model.predict(np.array(data_range).reshape(-1, 1))


fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(x, y)
plt.plot(data_range, list(map(func, data_range)), label="Truth")
plt.plot(data_range, predictions, label="Prediction")
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Linear regression")
plt.legend()
plt.show()

<h3> Q6 </h3>

Run the cell below to display a dataframe.

In [None]:
pd.read_csv('dailyActivity_merged.csv')

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


<h3> Q7 </h3>
Let's have some fun with markdown.  Create a markdown cell below and make a numbered list of your top 3 favorite places on campus.  Bold the first item, italicize the second, and bold and italicize the third.  Your list should look similar to below.  Try Googling for markdown syntax.  If you need help, feel free to double-click this cell.

1. **Place 1**
2. *Place 2*
3. ***Place 3***

<h3> Q8 </h3>
Now let's try some latex. Create a cell below and try recreating the formula below. Don't worry, you don't need to understand what the formula means (yet). Remember that you can render latex by surrounding the formula with dollar signs. If you need help, you can double-click this cell.

Hint: $\sigma$ is called sigma and $\mu$ is called mu.

$$P(X \leq b) = \int_{-\infty}^{b} \frac{1}{\sigma \sqrt{2\pi}} e^{\frac{(t-\mu)^2}{2\sigma^2}} \,dt$$

<a id='section5'></a>
# (Optional) JupyterLab Guide

![image](jupyterLab_interface.png)

JupyterLab is the next-generation notebook Interface. The core of JupyterLab is still the jupyter notebook, but JupyterLab includes more diverse and ampful functionalities for more efficient and powerful coding.

Here are some highlights of JupyterLab that might sound interesting to you:

1. It extends the functionality of the classic jupyter notebook. For example, you can move a cell by using the cursor to drag it. You also have the access to a **debugger**. 
2. It has an extra left sidebar that shows the file management system, Git, etc.
3. It has more customization to optimize your coding experience (e.g. splitting the screen, changing the font and interface color)

With JupyterLab, you can code whenever and wherever you want!

Moreover, JupyterLab is the main (or even only) interface for many cloud servers such as Amazon AWS, Lambda Could, and Saturn Cloud.

To enter the JupyterLab interace in your Berkeley DataHub, change the word `notebooks` to `lab` in the url. For example, you should change `https://datahub.berkeley.edu/user/wenhao1102/notebooks/Untitled.ipynb` to `https://datahub.berkeley.edu/user/wenhao1102/lab/Untitled.ipynb`

# Fun Fact

The father of the Jupyter Notebook, [Fernando Pérez](https://statistics.berkeley.edu/people/fernando-perez) is a Professor of our Statistics department! He taught DATA 100 and STAT 159.