# Linux

## Installation

## Setup

Please, before we begin, use run the command: _sudo apt install ssh default-java_

## A Quick Crash Course to Common Shell Commands

Here, I'll be providing a quick introduction of a number of common linux shell commands, just to get you started.

For those looking for more depth, I have provided a link to a fairly comprehensive list [here](https://ss64.com/bash/); a book on the topic, which also covers bash scripting [here](http://linuxcommand.org/tlcl.php).

Should you ever want to know more about a specific command, one of the best and most immediate resources is to run _man `<command in question`>_, which will open the manual pages for that command.

Now: on to the actual crash course.

First, run the following command to download and install the provided example.

In [None]:
%%bash
wget CDS-Linux-Example.tar.gz
tar xzf CDS-Linux-Example.tar.gz
rm CDS-Linux-Example.tar.gz
cd example

First, a small chat about the linux filesystem. In linux, everything in the file system falls into one of two categories: it is either a *file*, or it is a *directory*.

A file put simply is just a reserved section of hard drive space with a name, while a directory just has a bunch of pointers to other entities in the filesystem (namely, other files and directories).

The linux file system begins with the root node, and it branches out from there. While we can traverse away from there, it may be useful to know where the heck we are:

In [4]:
%%bash
# The above causes each line of this cell to be run with bash

# First, lets see where we are exactly. Type pwd (Print Working Directory) to get the location
pwd

/home/cai29/Cornell/CDS/spark_ws


Cool, we know where we are. But what's around us? We're in the dark at the moment:

In [None]:
%%bash
# The ls (list) command to see what entities are in the current directory
ls

Better, but The output for that is kinda vague. Lets make it clearer by adding a flag. A flag is an option which may be passed to and used by a program to set certain settings at runtime.

In [None]:
%%bash
# The ls -l prints the contents of the current directory with longer, more detailed output
ls -l

Cool! Now we know what's around us. We can see a number of files and directories (including two strange directories: "." and ".."  We'll get to these later).

Let's see if we can't go somewhere. The totally_normal_directory seems a good place to start:

In [None]:
%%bash
# The cd (change directory) command moves into the specified directory
cd totally_normal_directory

# Anaconda

Anaconda is a prepackaged Python Ecosystem geared towards Data Science. Anaconda and it's supporting products are supplid by Continuum Analytcis. Anaconda comes pre-built with a wide variety of packages; a full list can be found [here](docs.continuum.io/anaconda/packages/pkg-docs).

## Installation

The installation of Anaconda is fairly simple. First, download the install package supplied by Continuum Analytics and give it execute privileges:

In [None]:
%%bash
cd ~
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
chmod 733 Anaconda*.sh 

Then, run the provided install script; this will walk you through the installation and configuration of Anaconda (for configuration, the defaults will usually be fine):

In [None]:
%%bash
xterm -e bash -c "Anaconda*.sh"
rm Anaconda*.sh

## Package Installations with Conda and Pip

For this section, I think just linking the cheat sheet might be better than anything I could possibly write in a comparable space (I will likely be using this cheat sheet myself, now that I've found it): 
https://conda.io/docs/_downloads/conda-cheatsheet.pdf

## Virtual Environments with Conda and Pip

As above - maybe just a description of what a virtual env _is_? Cause the cheat sheet because the cheat sheet seems sufficient, at least so far as use goes. Especially for this course

# Hadoop

Hadoop is a framework for a distributed filesystem which allows users to store large data sets accross multiple clusters, while maintaining integrity in the face of failure. It includes HDFS which provides access to application data, and YARN, which is a framework for job scheduling and resource management.

## Installation

Begin by creating a hadoop group, and add your existing user to it. Groups your user belongs to will grant you certain permissions for files belonging to that group.

In [None]:
%%bash
xterm -e bash \
    -c "sudo addgroup hadoop; sudo usermod -a -G hadoop $USER" 

Then, generate SSH keys to be used for verification with the system. Copy this key to your user on localhost.

In [None]:
%%bash
xterm -hold -e bash \
    -c "ssh-keygen; ssh-copy-id $USER@localhost"

Now that setup is done, it's time to actually install Hadoop. First and move into a directory named CDS, located in your home directory. Then, use wget to acquire the tar archive for hadoop. Extract hadoop from the archive and remove the archive. 

This hadoop directory will have a rather cumbersome name. Create a symbolic link to it, simply named hadoop, in CDS.

In [None]:
%%bash
mkdir CDS
cd CDS
wget http://apache.osuosl.org/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
tar xzf *.gz
rm *.gz
ln -s hadoop hadoop*

## Configuration

CDS supplies a number of small files of configuration changes to make to your Hadoop and Spark setups. Dowload the tar file with wget and unpack it.

In [None]:
%%bash
wget <link>
tar xzf CDS-config-files.tar.gz
rm CDS-config-files.tar.gz

First of all, append the Hadoop-bashrc-snippet file to your bashrc file. .bash rc is a configuration file used by your bash shell.

In [None]:
%%bash
cat Hadoop-bashrc-snippet.txt >> ~/.bashrc

Next, we need an actual place to mount the hadoop file system. We will create a hadoop directory in the system /var directory for this purpose. Then, change the ownership of this file to our user.

In [None]:
%%bash
xterm -e bash \
    -c "mkdir /var/lib/hadoop; sudo chown -R $USER:hadoop /var/lib/hadoop"

There are two small changes we want to make manually to the config files. First, change the variable HADOOP opts to disable ipv6.
Second, remove the code segment "${JAVA_HOME}", and replace it with the location of your chosen jdk directory.

In [None]:
%%bash
sed s/"^.*export HADOOP_OPTS=.*\$"/"export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true"/ \
        hadoop/etc/hadoop/hadoop-env.sh
set s/"\${JAVA_HOME}"/"/usr/lib/jvm/default-java"/ \
        hadoop/etc/hadoop/hadoop-env.sh

Then, append the following files to their respective xml files.

In [None]:
%%bash
cat Hadoop-core-snippet.txt >> hadoop/etc/hadoop/core-site.xml
cat Hadoop-hdfs-snippet.txt >> hadoop/etc/hadoop/hdfs-site.xml
cat Hadoop-yarn-snippet.txt >> hadoop/etc/hadoop/yarn-site.xml

Finally, we will format our hdfs directory and start our hadoop file systems. Check your results with the jps command.

In [None]:
%%bash
hdfs namenode -format
start-all.sh
jps

The output of this should be something like the following:

9648 Jps

8260 ResourceManager

8389 NodeManager

9147 DataNode

8989 NameNode

9342 SecondaryNameNode

# Spark

Spark is a general-purpose cluser computing system which provides APIs for several high level computing languages, including Python. In addition, it supports higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

## Installation

As with the hadoop installation, we want to install, unpack and remove the provided tar archive. Again, we create a symbolic link to the resulting directory which is simply named spark.

In [None]:
%%bash
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar xzf *.tgz
rm *.tgz
ln -s spark spark*

## Configuration

Then, check if the spark-env.sh file exists in the spark/conf directory. If it does not, create a copy of it from the spark-env.sh.template file in spark/conf.

Append the contents of Spark-conf-snippet.txt to the spark-env.sh file.

In [None]:
%%bash
if [ ! -f spark/conf/spark-env.sh] then
    cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh
fi
cat Spark-conf-snippet.txt >> spark/conf/spark-env.sh

Finally, append the contents of Spark-bashrc.txt to your .bashrc file.

In [None]:
%%bash
cat Spark-bashrc.txt >> ~/.bashrc

And you're done! You've installed Anaconda, Hadoop and Spark