<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Introduction to </b> <span style="font-weight:bold; color:green">HDFS</span></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru, papulin_hse@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">HDFS Shell Commands</a></li>
        <li><a href="#2">HDFS Java API</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Reading Files</a></li>
                <li><a href="#2b">Copying Files From Local To HDFS</a></li>
                <li><a href="#2c">Other manipulations</a></li>
                <li><a href="#2d">Running on Cloudera</a></li>
            </ol>
        </li>
        <li><a href="#3">HDFS on EMR Cluster</a></li>
        <li><a href="#4">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. HDFS Shell Commands</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<p><b>Basic Commands</b></p>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p><a href="https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html">The File System (FS) shell</a></p>
     </div>
</div>

<p>Connect to your local cloudera via SSH</p>

<p>Show HDFS topology</p>

In [None]:
hdfs dfsadmin -printTopology

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>The cloudera user has all permissions</p>
     </div>
</div>

<p>Show a directory</p>

In [None]:
hdfs dfs -ls /

<p>Create a directory in HDFS on behalf of the hdfs user</p>

In [None]:
sudo -u hdfs hdfs dfs -mkdir -p /hdfs_data

<p>Assign permissions to users in HDFS</p>

In [None]:
sudo -u hdfs hdfs dfs -chmod -R 777 /data

<p>Create a subdirectory</p>

In [None]:
hdfs dfs -mkdir -p /hdfs_data/first_data

<p>Copy files from your local file system to the HDFS</p>

In [None]:
hdfs dfs -copyFromLocal -f /YOUR_PATH/* /hdfs_data/first_data

<p>Check that all data was correctly copied</p>

In [None]:
hdfs dfs -ls /hdfs_data/first_data

<p>Show a report about blocks of the file</p>

In [None]:
hdfs hdfs fsck /hdfs_data/first_data/YOUR_FILE -files -blocks -locations

<p>Displays last kilobyte</p>

In [None]:
hdfs dfs -tail /hdfs_data/first_data/YOUR_FILE

<p><b>HDFS Dashboard</b></p>

<p>Use port forwarding from your local host to the local VM to access a HDFS dashboard</p>

In [None]:
sudo ssh -N -f -L 9961:quickstart.cloudera:50070 cloudera@127.0.0.1 -p 2222

<p>Open a web brower to see a HDFS dashboard</p>

<div class="code-block code-font"><a href="http://localhost:9961">http://localhost:9961</a></div>

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. HDFS Java API</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>It's better to develope apps on a machine where a Hadoop cluster is running in one of the available modes. For example, you can install IntelliJ on the local cloudera VM and develope app there.
          </p>
     </div>
</div>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p><a href="https://hadoop.apache.org/docs/current/api/overview-summary.html">Current Version of Hadoop API</a>
          </p>
     </div>
</div>

Check a version of your HDFS

In [None]:
hdfs version

<p>Print out a version of your JDK</p>

In [None]:
java -version

<div class="msg-block msg-warning">
  <div class="msg-text-warn"><p>Version of JDK on the Cloudera Local VM is 7. So you have to set this version for your projects in the IntelliJ IDE<p></div>
</div>

<p><b>JDK 7 Installation</b></p>

<div class="msg-block msg-ref">
  <div class="msg-text-ref"><p>For more information use link below<br><a href="http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html">Java SE 7 Archive Downloads</a><br><a href="https://askubuntu.com/questions/920106/webupd8-oracle-java-7-installer-failing-with-404">webupd8 oracle-java-7-installer failing with 404</a></p></div>
</div>

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Reading Files
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Creating HDFS App jar files in IntelliJ IDE</b></p>

IntelliJ

In [None]:
https://www.jetbrains.com/help/idea/creating-running-and-packaging-your-first-java-application.html

<p>Create a new project</p>

In [None]:
File -> Project -> Java (select appropriate sdk) -> Next -> Project Name: HDFSTest2 -> Finish

<p>Attach hadoop dependencies through the Maven</p>

In [None]:
File -> Project Structure -> Modules -> Dependencies -> Add -> Library -> From Maven ->

<p>Create a new class</p>

In [None]:
Right Click in scr -> New -> Java Class -> edu.classes.hdfs.MyReadFile

In [None]:
org.apache.hadoop:hadoop-client:2.6.0

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>For more information use link below<br><a href="https://mrchief2015.wordpress.com/2015/02/09/compiling-and-debugging-hadoop-applications-with-intellij-idea-for-windows/
">HOW-TO: COMPILE AND DEBUG HADOOP APPLICATIONS WITH INTELLIJ IDEA IN WINDOWS OS (64BIT)</a><br><a href="https://blog.cloudera.com/blog/2014/06/how-to-create-an-intellij-idea-project-for-apache-hadoop/">How-to: Create an IntelliJ IDEA Project for Apache Hadoop</a></p></div>
</div>

<p>Add the code below to your class file</p>

In [None]:
// Hadoop Definitive Guide 

package edu.classes.hdfs;

import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.InputStream;
import java.net.URI;

public class MyReadFile {

    public static void main(String[] args) throws Exception {

        String uri = args[0];

        Configuration conf = new Configuration();

        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        InputStream in = null;

        try {

            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {

            IOUtils.closeStream(in);
        }
    }
}

<p>Edit configurations and run the project</p>

In [None]:
Run -> Edit configurations -> Add -> Main class: edu.classes.hdfs.MyReadFile; Program arguments: /input -> Apply

In [None]:
Run -> Run 'configuration name'

<p>Build Jar File</p>

In [None]:
File -> Project Structure -> Artifacts -> Add -> Jar -> From modules... -> Main Class: edu.classes.hdfs.MyReadFile -> Apply -> Ok

In [None]:
Build -> Build Artefacts

<div class="msg-block msg-ref">
  <div class="msg-text-ref"><p>For more information use link below<br><a href="http://www.lifeincode.net/programming/hadoop-building-the-jar-of-wordcount-in-intellij-idea/">Building the Jar of wordcount in IntelliJ IDEA</a></p></div>
</div>

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Copying Files From Local To HDFS 
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2c">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
package edu.classes.hdfs;

import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.Progressable;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;


public class MyFileCopy {

    public static void main(String[] args) throws Exception {

        String localPath = args[0];
        String hdfsPath = args[1];

        InputStream in = new BufferedInputStream(new FileInputStream(localPath));

        Configuration conf = new Configuration();
        conf.setInt("dfs.block.size",16777216); // 16MB

        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);

        // fs.copyFromLocalFile();

        OutputStream out = fs.create(new Path(hdfsPath), new Progressable() {
            @Override
            public void progress() {
                System.out.print(".");
            }
        });

        IOUtils.copyBytes(in, out, 4096, true);

    }
}

<a name="2c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Other manipulations
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2d">Next</a>
            </div>
        </div>
    </div>
</div>

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>For more information use link below<br><a href="https://github.com/saagie/example-java-read-and-write-from-hdfs/blob/master/src/main/java/io/saagie/example/hdfs/Main.java">example-java-read-and-write-from-hdfs</a><br><a href="https://tutorials.techmytalk.com/2014/08/16/hadoop-hdfs-java-api/">Hadoop HDFS JAVA API</a></p></div>
</div>

<p>Get/set property values in configuration files</p>

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>The default configurations of a HDFS node see <a href="https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">here</a></p></div>
</div>

In [None]:
# TODO

<a name="2d"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            d. Running on Cloudera
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2c">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Connect to the local Cloudera VM via SSH and create the <span class="code-font">/home/cloudera/classes/hdfs</span> directory</p>

<p>Copy data to your local Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/Input/data.txt cloudera@127.0.0.1:/home/cloudera/classes/hdfs

<p>Copy the jar files</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/out/artifacts/CopyFile_jar/CopyFile.jar cloudera@127.0.0.1:/home/cloudera/classes/hdfs

In [None]:
sudo scp -P 2222 /YOUR_PATH/out/artifacts/HDFSTest2_jar/ReadFile.jar cloudera@127.0.0.1:/home/cloudera/classes/hdfs

<p>Connect to the local Cloudera VM via SSH</p>

<p>Run the jar file to copy the local data to the HDFS</p> 

In [None]:
hadoop jar /home/cloudera/classes/hdfs/CopyFile.jar \
            /home/cloudera/classes/hdfs/data.txt \
            hdfs:///hdfs_data/output/data.txt

<p>Verify that the file is in the HDFS</p>

In [None]:
hdfs dfs -ls /hdfs_data/output

<p>Print out last 1kb of the file</p>

In [None]:
hdfs dfs -tail /hdfs_data/output/data.txt

<p>Run the jar file to read data from HDFS</p>

In [None]:
hadoop jar /home/cloudera/classes/hdfs/ReadFile.jar \
            hdfs:///hdfs_data/output/data.txt

<p>Delete the output directory</p>

In [None]:
hdfs dfs -rm -r /hdfs_data/output

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. HDFS on EMR Cluster</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Deploy an EMR cluser with 3 instances and run the apps from the previous section</p></div>
</div>

<p>1. Check a JDK and Hadoop version on the EMR Master</p>

<p>2. Create an IntelliJ Project with an appropriate JDK and Hadoop version</p>

<p>3. Create jar files for the ReadFile and CopyFile classes</p>

<p>4. Copy jar files to the EMR Master</p>

<p>5. Download archive of data using <span class="code-font">wget</span> to a local file system of the EMR Master, and extract data</p>

<p>Reference to dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz</p>

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>For more information about this dataset click on the link below<br><a href="http://jmcauley.ucsd.edu/data/amazon/">Amazon product data</a></p></div>
</div>

<p>6. Run the jar file for copying the extracted data to the HDFS</p>

<p>7. Check that the file successfully uploaded to the HDFS using shell commands</p>

<p>8. Display location of file blocks in the cluster</p>

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>