# Hadoop Distributed File System (HDFS)

- The Hadoop Distributed File System (HDFS) is a Java-based dis‐
tributed, scalable, and portable filesystem designed to span large
clusters of commodity servers. The design of HDFS is based on GFS,
the Google File System
- HDFS is designed to store a lot of information, typically petabytes
(for very large files), gigabytes, and terabytes. This is accomplished
by using a block-structured filesystem. Individual files are split into
fixed-size blocks that are stored on machines across the cluster. Files
made of several blocks generally do not have all of their blocks
stored on a single machine.
- a Python client library is intro‐
duced that enables HDFS to be accessed programmatically from
within Python applications.

## Overview of HDFS
-  __ NameNode__ e holds the metadata for the filesystem
  -  the most important machine in HDFS. It stores
metadata for the entire filesystem: filenames, file permissions, and
the location of each block of each file
-  __DataNode__ processes store the blocks that make up the files. 
- The NameNode and DataNode processes can run
on a single machine
- HDFS clusters commonly consist of a dedi‐
cated server running the NameNode process and possibly thousands
of machines running the DataNode process
- The example in Figure 1-1 illustrates the mapping of files to blocks
in the NameNode, and the storage of blocks and their replicas
within the DataNodes.
<img src='images/f1.1.png'>

## Interacting with HDFS

`$ hdfs COMMAND [-option <arg>]`
The COMMAND argument instructs which functionality of HDFS will
be used. The -option argument is the name of a specific option for
the specified command, and <arg> is one or more arguments that
that are specified for this option.

## Common File Operations

List Directory Contents<br>
To list the contents of a directory in HDFS, use the -ls command:<br>
```
$ hdfs dfs -ls
$ ```

Providing -ls with the forward slash (/) as an argument displays
the contents of the root of HDFS:
```$ hdfs dfs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2015-09-20 14:36 /hadoop
drwx------ - hadoop supergroup 0 2015-09-20 14:36 /tmp```

## Creating a Directory
`$ hdfs dfs -mkdir /user`

To make a home directory for the current user, hduser, use the
-mkdir command again:<br>
`$ hdfs dfs -mkdir /user/hduser`<br>
Use the -ls command to verify that the previous directories were
created<br>
```$ hdfs dfs -ls -R /user
drwxr-xr-x - hduser supergroup 0 2015-09-22 18:01 /user/
hduser```

## Copy Data onto HDFS

 data can be
uploaded to the user’s HDFS home directory with the -put com‐
mand:<br>
```
$ hdfs dfs -put /home/hduser/input.txt /user/hduser
```

Use the -ls command to verify that input.txt was moved to HDFS:
```$ hdfs dfs -ls
Found 1 items
-rw-r--r-- 1 hduser supergroup 52 2015-09-20 13:20
input.txt```

## Retrieving Data from HDFS

 The following com‐
mand uses -cat to display the contents of /user/hduser/input.txt:
```
$ hdfs dfs -cat input.txt
jack be nimble
jack be quick
jack jumped over the candlestick```

Data can also be copied from HDFS to the local filesystem using the
-get command. The -get command is the opposite of the -put
command:
```$ hdfs dfs -get input.txt /home/hduser```
This command copies input.txt from /user/hduser on HDFS
to /home/hduser on the local filesystem