# Hadoop

- Data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. 
- Data analysis requires massively parallel software running on several servers.
- **Volume, Variety, Velocity, Variability and Veracity** describe Big Data properties.

![https://github.com/veekaybee/data-lake-talk/](https://github.com/pnavaro/big-data/blob/master/notebooks/images/bigdata.png?raw=1)

![Hadoop Logo](https://github.com/pnavaro/big-data/blob/master/notebooks/images/hadoop.png?raw=1)

- Framework for running applications on large cluster. 
- The Hadoop framework transparently provides applications both reliability and data motion. 
- Hadoop implements the computational paradigm named **Map/Reduce**, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. 
- It provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
- Both MapReduce and the **Hadoop Distributed File System** are designed so that node failures are automatically handled by the framework.

## HDFS

* It is a distributed file systems.
* HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
* HDFS is suitable for applications that have large data sets. 
* HDFS provides interfaces to move applications closer to where the data is located. The computation is much more efficient when the size of the data set is huge. 
* HDFS consists of a single NameNode with a number of DataNodes which manage storage. 
* HDFS exposes a file system namespace and allows user data to be stored in files. 
    1. A file is split by the NameNode into blocks stored in DataNodes. 
    2. The [NameNode](http://svmass2.mass.uhb.fr:50070) executes operations like opening, closing, and renaming files and directories.
    3. The [Secondary NameNode](http://svmass2.mass.uhb.fr:50090/status.html) stores information from **NameNode**. 
    4. The **DataNodes** manage perform block creation, deletion, and replication upon instruction from the NameNode.
    5. The placement of replicas is optimized for data reliability, availability, and network bandwidth utilization.
    6. User data never flows through the NameNode.
* Files in HDFS are write-once and have strictly one writer at any time.
* The DataNode has no knowledge about HDFS files.

## Accessibility

All [HDFS commands](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html)  are invoked by the bin/hdfs Java script:
```shell
hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
```
## Manage files and directories
```shell
hdfs dfs -ls -h -R # Recursively list subdirectories with human-readable file sizes.
hdfs dfs -cp  # Copy files from source to destination
hdfs dfs -mv  # Move files from source to destination
hdfs dfs -mkdir /foodir # Create a directory named /foodir	
hdfs dfs -rmr /foodir   # Remove a directory named /foodir	
hdfs dfs -cat /foodir/myfile.txt #View the contents of a file named /foodir/myfile.txt	
```

## Transfer between nodes

### put
```shell
hdfs fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>
```
Copy single src, or multiple srcs from local file system to the destination file system. 

Options:

    -p : Preserves rights and modification times.
    -f : Overwrites the destination if it already exists.

```shell
hdfs fs -put localfile /user/hadoop/hadoopfile
hdfs fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
```
Similar to the fs -put command
- `moveFromLocal` : to delete the source localsrc after copy.
- `copyFromLocal` : source is restricted to a local file
- `copyToLocal` : destination is restricted to a local file

![hdfs blocks](https://github.com/pnavaro/big-data/blob/master/notebooks/images/hdfs-fonctionnement.jpg?raw=1)

The Name Node is not in the data path. The Name Node only provides the map of where data is and where data should go in the cluster (file system metadata).

To do following hands on you can switch to [JupyterLab](https://jupyterlab.readthedocs.io).

Just go to this following address http://localhost:9000/lab

- Check that your HDFS home directory required to execute MapReduce jobs exists:
```bash
hdfs dfs -ls /user/${USER}
```
- Type the following commands: 
```bash
hdfs dfs -ls
hdfs dfs -ls /
hdfs dfs -mkdir test
```
- Create a local file user.txt containing your name and the date:

In [5]:
%%bash
echo "Hamza Dehidi" > Hamza_Dehidi.txt
echo `date` >> Hamza_Dehidi.txt 
cat Hamza_Dehidi.txt

Hamza Dehidi
Thu Apr 28 17:58:41 UTC 2022


Copy it on  HDFS :
```bash
hdfs dfs -put user.txt
```

Check with:
```bash
hdfs dfs -ls -R 
hdfs dfs -cat user.txt 
hdfs dfs -tail user.txt 
```

In [6]:
%%bash
hdfs dfs -put Hamza_Dehidi.txt
hdfs dfs -ls -R /user/alice/

drwxr-xr-x   - alice supergroup          0 2022-04-28 14:39 /user/alice/.skein
drwx------   - alice supergroup          0 2022-04-27 08:21 /user/alice/.skein/application_1651047611236_0002
-rw-r--r--   1 alice supergroup       1013 2022-04-27 08:21 /user/alice/.skein/application_1651047611236_0002/.skein.crt
-rw-r--r--   1 alice supergroup       1704 2022-04-27 08:21 /user/alice/.skein/application_1651047611236_0002/.skein.pem
-rw-r--r--   1 alice supergroup       1460 2022-04-27 08:21 /user/alice/.skein/application_1651047611236_0002/.skein.proto
-rw-------   1 alice supergroup    7806621 2022-04-27 08:21 /user/alice/.skein/application_1651047611236_0002/skein.jar
drwx------   - alice supergroup          0 2022-04-27 17:34 /user/alice/.skein/application_1651080743810_0001
-rw-r--r--   1 alice supergroup       1013 2022-04-27 17:33 /user/alice/.skein/application_1651080743810_0001/.skein.crt
-rw-r--r--   1 alice supergroup       1708 2022-04-27 17:33 /user/alice/.skein/application_1651

In [7]:
%%bash
hdfs dfs -cat Hamza_Dehidi.txt

Hamza Dehidi
Thu Apr 28 17:58:41 UTC 2022


# Example Commands and Descriptions
Remove the file:
```bash
hdfs dfs -rm user.txt
```

Put it again on HDFS and move to books directory:
```bash
hdfs dfs -copyFromLocal user.txt
hdfs dfs -mv user.txt books/user.txt
hdfs dfs -ls -R -h
```

Copy user.txt to hello.txt and remove it.
```bash
hdfs dfs -cp books/user.txt books/hello.txt
hdfs dfs -count -h /user/$USER
hdfs dfs -rm books/user.txt
```

## Hands-on practice:

1. Create a directory `files_name_surname` in HDFS.
2. List the contents of a directory /.
3. Upload the file today.txt in HDFS.
```bash
date > today.txt
whoami >> today.txt
```
4. Display contents of file `today.txt`
5. Copy `today.txt` file from source to `files_name_surname` directory.
6. Copy file `jps.txt` from/To Local file system to HDFS
```bash
jps > jps.txt
```
7. Move file `jps.txt` from source to `files`.
8. Remove file `today.txt` from home directory in HDFS.
9. Display last few lines of `jps.txt`.
10. Create a pandas dataframe from "data" and save it to your HDFS `files` folder (You may use the "!hdfs dfs" syntax)

In [8]:
%%bash
date > today.txt
whoami >> today.txt

In [5]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 1 
# Cancelled Due to Permission Error

In [14]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 2
hdfs dfs -ls /user/alice/input

Found 1 items
-rw-r--r--   1 alice supergroup         58 2022-04-27 14:15 /user/alice/input/name_surname.txt


In [22]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 3
hdfs dfs -put today.txt

In [16]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 4
hdfs dfs -cat today.txt

Thu Apr 28 08:16:21 UTC 2022


In [None]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 5
# Cancelled Due to Permission Error

In [24]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 6

echo 'jps' > jps.txt
hdfs dfs -copyFromLocal jps.txt

In [25]:
%%bash
hdfs dfs -rm /user/alice/files/jps.txt # remove jps.txt if it exists

# WRITE YOUR HDFS COMMANDS HERE - Task 7
hdfs dfs -mv jps.txt /user/alice/files/jps.txt

Deleted /user/alice/files/jps.txt


In [26]:
%%bash 
hdfs dfs -ls /user/alice/files jps.txt # to make sure jps.txt is in the correct directory

Found 7 items
-rw-r--r--   1 alice supergroup         46 2022-04-28 20:33 /user/alice/files/berna_bilcan_sen.txt
-rw-r--r--   1 alice supergroup         46 2022-05-01 14:54 /user/alice/files/can_mergen.csv
-rw-r--r--   1 alice supergroup       4349 2022-04-28 20:13 /user/alice/files/data.csv
-rw-r--r--   1 alice supergroup        574 2022-04-28 20:18 /user/alice/files/df.txt
-rw-r--r--   1 alice supergroup         11 2022-05-06 20:06 /user/alice/files/ersin_sonmez_today.txt
-rw-r--r--   1 alice supergroup          4 2022-05-08 13:07 /user/alice/files/jps.txt
-rw-r--r--   1 alice supergroup         35 2022-04-28 20:14 /user/alice/files/today.txt


In [23]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 8
hdfs dfs -rm /user/alice/today.txt

Deleted /user/alice/today.txt


In [27]:
%%bash
# WRITE YOUR HDFS COMMANDS HERE - Task 9
hdfs dfs -tail /user/alice/files/jps.txt

jps


In [36]:
# WRITE YOUR PYTHON AND HDFS COMMANDS HERE - Task 10
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)

In [44]:
%%bash
df > df.txt

In [45]:
!hdfs dfs -put df.txt /user/alice/files/df.txt