# For those who cannot install linux, set up a linux vm with the following steps

A vm, or virtual machine, is a digital simulation of a computer.

For this we're going to install virtual box. Windows Users should download this file:
http://download.virtualbox.org/virtualbox/5.1.28/VirtualBox-5.1.28-117968-Win.exe

Mac users should install this:
http://download.virtualbox.org/virtualbox/5.1.28/VirtualBox-5.1.28-117968-OSX.dmg

Then, download the Ubuntu ISO:
https://www.ubuntu.com/download/desktop

Run virtual box. Hit "New". Follow the prompts to create a vm.

For the name and operating system screen, pick Linux for type, and Ubuntu (64-bit) for Version (if your current OS is 32-bit, you should pick Ubuntu 32-bit).

Try to allocate a large amount of ram for your virtual machine - parallel computing is quite expensive computationally.

For the Hard disk screen, choose to create a new virtual hard disk now. Choose VirtualBox Disk Image for the Hard disk file type. Then, select dynamicaly allocated for your storage type. Again, we recommend capping the hard disk at around 30 GB.

# Anaconda

Anaconda is a prepackaged Python Ecosystem geared towards Data Science. Anaconda and it's supporting products are supplid by Continuum Analytcis. Anaconda comes pre-built with a wide variety of packages; a full list can be found [here](docs.continuum.io/anaconda/packages/pkg-docs).

## Installation

The installation of Anaconda is fairly simple. First, download the install package supplied by Continuum Analytics and give it execute privileges:

In [None]:
cd ~/Downloads
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
chmod 733 Anaconda*.sh 

Then, run the provided install script; this will walk you through the installation and configuration of Anaconda (for configuration, the defaults will usually be fine):

In [None]:
./Anaconda*.sh
rm Anaconda*.sh

## Virtual Environments and Package Installation with Conda and Pip

Anaconda comes with not one, but two seperate package managers: the traditional python package manager pip, and it's own package manager, Conda. Both can be used to install additional python packages from their respective repositories.
 
It is often quite usefull to be able to manage seperate installations of python/python packages, for example when different versions of the same package are required by different projects. This is accomplished through virtual environments, or seperate, selectable installations of python and it's package. These can be created and managed using conda, or using pip and the virtualenv package.

The following cheatsheet should be a suffecient primer on how to use Conda and Pip for package installation and virtual environment management.

https://conda.io/docs/_downloads/conda-cheatsheet.pdf

# Where to Begin

Please, before we begin, open your console and run the command: 

In [None]:
sudo apt install ssh openssh-server

# Java

For hadoop to work properly, you first need to install java. To do so, run the following lines:

In [None]:
# First, we want to add the 
# repository that holds Oracle's Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer

Next, download the default environment variables and set oracle-java8 as your default java version.

In [None]:
sudo apt install oracle-java8-set-default
sudo update-alternatives --config java

# Hadoop

Hadoop is a framework for a distributed filesystem which allows users to store large data sets accross multiple clusters, while maintaining integrity in the face of failure. It includes HDFS which provides access to application data, and YARN, which is a framework for job scheduling and resource management.

## Installation

Begin by creating a hadoop group, and add your existing user to it. Groups your user belongs to will grant you certain permissions for files belonging to that group.

In [None]:
sudo addgroup hadoop 
sudo usermod -a -G hadoop $USER

Then, generate SSH keys to be used for verification with the system. Copy this key to your user on localhost.

In [None]:
ssh-keygen; ssh-copy-id $USER@localhost

Now that setup is done, it's time to actually install Hadoop. First, download Hadoop from the following page: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz. Then, move it to the /usr/local directory.

This hadoop directory will have a rather cumbersome name. Create a symbolic link to it, simply named hadoop, in /usr/local.

In [None]:
cd /usr/local
sudo mv ~/Downloads/hadoop-2.8.1.tar.gz .
sudo tar xzf *.gz
sudo rm *.gz
sudo chown -R $USER:hadoop hadoop*
sudo ln -s hadoop* hadoop 

## Configuration

CDS supplies a number of small files of configuration changes to make to your Hadoop and Spark setups. Download the Hadoop Configuration Files that were posted on Slack.

In [None]:
cd hadoop
mv ~/Downloads/CDS-Hadoop-config-files.tar.gz /usr/local/hadoop
tar xzf CDS-Hadoop-config-files.tar.gz
rm CDS-Hadoop-config-files.tar.gz

First of all, append the Hadoop-bashrc-snippet file to your bashrc file. .bash rc is a configuration file used by your bash shell.

In [None]:
cat Hadoop-bashrc-snippet.txt >> ~/.bashrc

Next, we need an actual place to mount the hadoop file system. We will create a hadoop directory in the system /var directory for this purpose. Then, change the ownership of this file to our user.

In [None]:
sudo mkdir /var/lib/hadoop
sudo chown -R $USER:hadoop /var/lib/hadoop

There are two small changes we want to make manually to the config files. First, change the variable HADOOP opts to disable ipv6.
Second, remove the code segment "${JAVA_HOME}", and replace it with the location of your chosen jdk directory.

In [None]:
sed -i s/"^export HADOOP_OPTS=.*\$"/"export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true"/ etc/hadoop/hadoop-env.sh
sed -i s+"^export JAVA_HOME=.*\$"+'export JAVA_HOME=/usr/lib/jvm/java-8-oracle'+ "etc/hadoop/hadoop-env.sh"

Then, append the following files to their respective xml files.

In [None]:
cat Hadoop-core-snippet.txt > etc/hadoop/core-site.xml
cat Hadoop-hdfs-snippet.txt >> etc/hadoop/dfs-site.xml
cat Hadoop-yarn-snippet.txt > etc/hadoop/yarn-site.xml
rm *-snippet.txt

Please open a new terminal or reload your bashrc scirpt. Finally, we will format our hdfs directory and start our hadoop file systems. Check your results with the jps command.

In [None]:
source ~/.bashrc
hdfs namenode -format
start-dfs.sh
start-yarn.sh
jps

The output of this should be something like the following:

9648 Jps

8260 ResourceManager

8389 NodeManager

9147 DataNode

8989 NameNode

9342 SecondaryNameNode

# Spark

Spark is a general-purpose cluser computing system which provides APIs for several high level computing languages, including Python. In addition, it supports higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

## Installation

As with the hadoop installation, we want to install, unpack and remove the provided tar archive. Again, we create a symbolic link to the resulting directory which is simply named spark.

In [None]:
cd /usr/local
sudo wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
sudo tar xzf *.tgz
sudo rm *.tgz
sudo chown -R $USER spark*
sudo ln -s spark* spark

## Configuration

Download the Spark Configuration Files posted on Slack. Then, check if the spark-env.sh file exists in the spark/conf directory. If it does not, create a copy of it from the spark-env.sh.template file in spark/conf.

Append the contents of Spark-conf-snippet.txt to the spark-env.sh file.

In [None]:
cd spark
sudo mv ~/Downloads/CDS-Spark-config-files.tar.gz /usr/local/spark
bash -c 'if [ ! -f conf/spark-env.sh ]; then cp conf/spark-env.sh.template conf/spark-env.sh; fi'
tar xzf CDS-Spark-config-files.tar.gz
rm CDS-Spark-config-files.tar.gz
cat Spark-conf-snippet.txt >> conf/spark-env.sh

Finally, append the contents of Spark-bashrc.txt to your .bashrc file.

In [None]:
cat Spark-bashrc-snippet.txt >> ~/.bashrc
rm *-snippet.txt

Test to see if everything is set up properly by running: 

In [None]:
source ~/.bashrc
pyspark --master yarn --deploy-mode client

You should check if:
    
A jupyter notebook app pops up

Make a new python 3 file, type sc into a cell, and run it. 
    You should see information about your spark session

Go to localhost:50070 using your browser and 
    confirm that there is an app running

Go to localhost:8088 using your borwser and confirm that there is an app running

And you're done! You've installed Anaconda, Hadoop and Spark