# Linux

If you have and are comfortable using Linux, skip to the Anaconda section.

## Installation

For the purposes of this workshop series we request that everyone have a working linux distro. If you don't have one at the moment, you have two options - you may either dual-boot or set up a virtual machine of a linux environment.

### Virtual Machines

Creating a virtual machine is comparatively easier than dual-booting, with less risk of data loss. 

First install virtual box: https://www.virtualbox.org/wiki/Downloads. 

Then, you'll need to download the Ubuntu ISO file: https://www.ubuntu.com/download/desktop. 

From here, run virtual box and hit New; follow the prompts for creating a a new virtual machine using the ISO you downloaded. We recommend allocating as much RAM and CPU use as possible to the virtual machine - parallel computing is quite expensive computationally.

### Dual-Booting on Windows

To dual boot, we will provide USBs to use to install windows. 

Prior to this, users are required to open disk manager and shrink one of their partitions, freeing up space into which Ubuntu can be installed.
**Backing Up Your Computer Prior To This Step Is Recommended**

Next, restart your computer. As it's booting up, hold F2 to access your BIOS settings. Here, you're going to want to enable booting from a USB. If you're having trouble with the installation later, we may also ask that Windows 8/10 users disable secure boot.

As for the installation itself, please ask a CDS staff member for a USB with live boot installed.

Please, before we begin, open your console and run the command: sudo apt install ssh

## A Quick Crash Course to Common Shell Commands

Here, I'll be providing a quick introduction of a number of common linux shell commands, just to get you started.

For those looking for more depth, I have provided a link to a fairly comprehensive list [here](https://ss64.com/bash/); a book on the topic, which also covers bash scripting [here](http://linuxcommand.org/tlcl.php).

Should you ever want to know more about a specific command, one of the best and most immediate resources is to run _man `<command in question`>_, which will open the manual pages for that command.

Now: on to the actual crash course.

First, run the following command to download and install the provided example.

In [None]:
%%bash
wget CDS-Linux-Example.tar.gz
tar xzf CDS-Linux-Example.tar.gz
rm CDS-Linux-Example.tar.gz
cd example

First, a small chat about the linux filesystem. In linux, everything in the file system falls into one of two categories: it is either a *file*, or it is a *directory*.

A file is just a reserved block of disk space with a name, while a directory just has a bunch of pointers to other entities in the filesystem (namely, other files and directories).

The linux file system begins with the root node, referred to as /, and it branches out from there. While we can traverse away from there, it may be useful to know where the heck we are:

In [4]:
%%bash
# The above causes each line of this cell to be run with bash

# First, lets see where we are exactly. Type pwd (Print Working Directory) to get the location
pwd

/home/cai29/Cornell/CDS/spark_ws


Cool, we know where we are. But what's around us? We're in the dark at the moment:

In [None]:
%%bash
# The ls (list) command to see what entities are in the current directory
ls

Better, but the output for that is kinda vague. Lets make it clearer by adding a flag. A flag is an option which may be passed to and used by a program to set certain settings or to perform certain actions atruntime.

In [None]:
%%bash
# The -l flag tells ls to print the contents of the current directory with longer, more detailed output
ls -l

Cool! Now we know what's around us. We can see a number of files and directories (including two strange directories: "." and ".."  We'll get to these later).

Let's see if we can't go somewhere. The totally_normal_directory seems a good place to start:

In [None]:
%%bash
# The cd (change directory) command moves into the specified directory
cd totally_normal_directory

Alright! We're in. Let's take a look around.

In [None]:
%%bash
ls -l

It's...empty. Well, that's anticlimactic. But is it actually? 

You see, directories can get really cluttered, really fast, so linux gives you to mark a file or directory as hidden by putting a period in front of it's name. Let's check again to see if there are any hidden files.

In [None]:
%%bash
# The -a flag tells ls to include hidden directories. Notice, we can use multiple flags!
ls -la

That's strange, what's...OH. UHH. UHHHHHHHH. Abort. Lets Just. Get Rid of This.

In [None]:
%%bash
# The rm command deletes a file. Adding the -r flag allows you to delete directories and their contents.
# * is known as a wild card, and expands to become all files in the current directory.
rm -r *

Aight, let's get out of here before we find anything else. So, where can we go?

In [None]:
%%bash
ls -la

All that's left are these . and .. directories... I suppose we'll have to try those.

In [None]:
%%bash
cd .

Nothing. Makes sense though. The . directory is just shorthand for the current working directory. Lets try .. next.

In [None]:
%%bash
cd ..

Success! We managed to go back a directory. This is because .. is shorthand for the directory one level up from the current working directory. Next, lets see waht we can do about this other file. First, lets see what's inside it.

In [None]:
%%bash
# The cat command takes in text and then outputs it right back out - NOTE: if no filename is specified, cat will
# take input from standard input.
cat nonsense.txt

.... It truly lives up to it's name. You may be curios how many random words I put in that file, but it would be a waste of all of our times to just count

In [None]:
%%bash
# The wc command outputs the word count of a file. Adding the -l flag will output the number of lines.b
wc nonsense.txt

Aight, enough nonsense. Let's get something better up in here. First, let's make somewhere to put it.

In [None]:
%%bash
# The mkdir command creates a new directory
mkdir new_direcctory
cd new_directory

Cool! Well ,lets get something downloaded here...how about War of the Worlds? That's a good book.

In [None]:
%%bash 
# The wget command takes a url and downloads the contents at that location.
wget http://www.gutenberg.org/cache/epub/36/pg36.txt

Alright, literature! Hmmm, that might take a little while to download. I nthe meantime, what's say we check out how our computers are handling it. Open a new console and enter the following command:

In [None]:
%%bash
# The top command displays the processes taking up resources on your machine, ordered by magnitude
top

Let's turn our attention back to the literature, shall we? So we just downloaded War of the worlds to the directory new_directory. That's not a great name, so lets change it.

In [None]:
%%bash
cd ..
# The mv command can be used to rename a file or directory
mv new_directory literature

Nice, that's a bit better. But, we still have nonsense.txt sitting around here...I guess it counts as literature. It has words, at least. Let's move it into the literature directory.

In [None]:
%%bash
# The mv command is also used to move files and directories into other directories
mv nonsense.txt literature

And that about does it for the crash course! These are a very small subset of possible linux commands, but they should give you a decent idea of what's going on during the rest of the installation process.

# Anaconda

Anaconda is a prepackaged Python Ecosystem geared towards Data Science. Anaconda and it's supporting products are supplid by Continuum Analytcis. Anaconda comes pre-built with a wide variety of packages; a full list can be found [here](docs.continuum.io/anaconda/packages/pkg-docs).

## Installation

The installation of Anaconda is fairly simple. First, download the install package supplied by Continuum Analytics and give it execute privileges:

In [None]:
%%bash
cd ~/Downloads
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
chmod 733 Anaconda*.sh 

Then, run the provided install script; this will walk you through the installation and configuration of Anaconda (for configuration, the defaults will usually be fine):

In [None]:
%%bash
./Anaconda*.sh
rm Anaconda*.sh

## Virtual Environments and Package Installation with Conda and Pip

Anaconda comes with not one, but two seperate package managers: the traditional python package manager pip, and it's own package manager, Conda. Both can be used to install additional python packages from their respective repositories.
 
It is often quite usefull to be able to manage seperate installations of python/python packages, for example when different versions of the same package are required by different projects. This is accomplished through virtual environments, or seperate, selectable installations of python and it's package. These can be created and managed using conda, or using pip and the virtualenv package.

The following cheatsheet should be a suffecient primer on how to use Conda and Pip for package installation and virtual environment management.

https://conda.io/docs/_downloads/conda-cheatsheet.pdf

# Hadoop

Hadoop is a framework for a distributed filesystem which allows users to store large data sets accross multiple clusters, while maintaining integrity in the face of failure. It includes HDFS which provides access to application data, and YARN, which is a framework for job scheduling and resource management.

## Installation

Begin by creating a hadoop group, and add your existing user to it. Groups your user belongs to will grant you certain permissions for files belonging to that group.

In [None]:
%%bash
sudo addgroup hadoop 
sudo usermod -a -G hadoop $USER

Then, generate SSH keys to be used for verification with the system. Copy this key to your user on localhost.

In [None]:
%%bash
ssh-keygen; ssh-copy-id $USER@localhost

Now that setup is done, it's time to actually install Hadoop. First and move into a directory named CDS, located in your home directory. Then, use wget to acquire the tar archive for hadoop. Extract hadoop from the archive and remove the archive. 

This hadoop directory will have a rather cumbersome name. Create a symbolic link to it, simply named hadoop, in CDS.

In [None]:
%%bash
cd /usr/local
sudo wget http://mirror.reverse.net/pub/apache/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
sudo tar xzf *.gz
sudo rm *.gz
sudo chown -R $USER:hadoop hadoop*
sudo ln -s hadoop* hadoop 

## Configuration

CDS supplies a number of small files of configuration changes to make to your Hadoop and Spark setups. Dowload the tar file with wget and unpack it.

In [None]:
%%bash
cd hadoop
wget <link>
tar xzf CDS-Hadoop-config-files.tar.gz
rm CDS-Hadoop-config-files.tar.gz

First of all, append the Hadoop-bashrc-snippet file to your bashrc file. .bash rc is a configuration file used by your bash shell.

In [None]:
%%bash
cat Hadoop-bashrc-snippet.txt >> ~/.bashrc

Next, we need an actual place to mount the hadoop file system. We will create a hadoop directory in the system /var directory for this purpose. Then, change the ownership of this file to our user.

In [None]:
%%bash
mkdir /var/lib/hadoop
sudo chown -R $USER:hadoop /var/lib/hadoop

There are two small changes we want to make manually to the config files. First, change the variable HADOOP opts to disable ipv6.
Second, remove the code segment "${JAVA_HOME}", and replace it with the location of your chosen jdk directory.

In [None]:
%%bash
sed -i s/"^export HADOOP_OPTS=.*\$"/"export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true"/ etc/hadoop/hadoop-env.sh
sed -i s+"^export JAVA_HOME=.*\$"+'export JAVA_HOME=/usr/lib/jvm/java-8-oracle'+ "etc/hadoop/hadoop-env.sh"

Then, append the following files to their respective xml files.

In [None]:
%%bash
cat Hadoop-core-snippet.txt > etc/hadoop/core-site.xml
cat Hadoop-hdfs-snippet.txt >> etc/hadoop/dfs-site.xml
cat Hadoop-yarn-snippet.txt > etc/hadoop/yarn-site.xml
rm *-snippet.txt

Please open a new terminal or reload your bashrc scirpt. Finally, we will format our hdfs directory and start our hadoop file systems. Check your results with the jps command.

In [None]:
%%bash
source ~/.bashrc
hdfs namenode -format
start-dfs.sh
start-yarn.sh
jps

The output of this should be something like the following:

9648 Jps

8260 ResourceManager

8389 NodeManager

9147 DataNode

8989 NameNode

9342 SecondaryNameNode

# Spark

Spark is a general-purpose cluser computing system which provides APIs for several high level computing languages, including Python. In addition, it supports higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

## Installation

As with the hadoop installation, we want to install, unpack and remove the provided tar archive. Again, we create a symbolic link to the resulting directory which is simply named spark.

In [None]:
%%bash
cd /usr/local
sudo wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
sudo tar xzf *.tgz
sudo rm *.tgz
sudo chown  $USER spark*
sudo ln -s spark* spark

## Configuration

Then, check if the spark-env.sh file exists in the spark/conf directory. If it does not, create a copy of it from the spark-env.sh.template file in spark/conf.

Append the contents of Spark-conf-snippet.txt to the spark-env.sh file.

In [None]:
%%bash
cd spark
bash -c 'if [ ! -f conf/spark-env.sh ]; then cp conf/spark-env.sh.template conf/spark-env.sh; fi'

wget <link>
tar xzf CDS-Spark-config-files.tar.gz
rm CDS-Spark-config-files.tar.gz
cat Spark-conf-snippet.txt >> conf/spark-env.sh

Finally, append the contents of Spark-bashrc.txt to your .bashrc file.

In [None]:
%%bash
cat Spark-bashrc-snippet.txt >> ~/.bashrc

Test to see if everything is set up properly by running: 

In [None]:
%%bash
source ~/.bashrc
pyspark --master yarn --deploy-mode client

You should check if:
    
    1. A jupyter notebook app pops up
    2. When you open a new notebook after opening jupyter notebook in your browser the following screen occurs
    3. Go to localhost:50070 using your browser and confirm that there is an app running


And you're done! You've installed Anaconda, Hadoop and Spark