## Copying files from local to HDFS

We can copy files from local file system to HDFS either by using `copyFromLocal` or `put` command.

* `hdfs dfs -copyFromLocal` or `hdfs dfs -put` – to copy files or directories from local filesystem into HDFS. We can also use `hadoop fs` in place of `hdfs dfs`.
* However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then copy back to HDFS.
* Files will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor. We will get into the details later.

![test](https://s3.amazonaws.com/kaizen.itversity.com/hadoop-overview/04HDFSAnatomyOfFileWrite.png)

In [1]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 1 items
drwxr-xr-x   - itversity supergroup          0 2022-10-31 03:04 /user/itversity/.sparkStaging


In [1]:
%%sh

hdfs dfs -ls -R /user/`whoami`

drwxr-xr-x   - itversity supergroup          0 2022-11-22 20:49 /user/itversity/.sparkStaging
drwx------   - itversity supergroup          0 2022-11-22 20:49 /user/itversity/.sparkStaging/application_1669145879076_0002
-rw-r--r--   1 itversity supergroup     237853 2022-11-22 20:49 /user/itversity/.sparkStaging/application_1669145879076_0002/__spark_conf__.zip
-rw-r--r--   1 itversity supergroup      42437 2022-11-22 20:49 /user/itversity/.sparkStaging/application_1669145879076_0002/py4j-0.10.7-src.zip
-rw-r--r--   1 itversity supergroup     593755 2022-11-22 20:49 /user/itversity/.sparkStaging/application_1669145879076_0002/pyspark.zip
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup       1029 2022-11-07 03:11 /user/itversity/retail_db/categories/part-00000
-rw-r--r--   1 itversity supergroup   10303297 2022-

In [3]:
%%sh

hdfs dfs -mkdir /user/`whoami`/retail_db

In [4]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 2 items
drwxr-xr-x   - itversity supergroup          0 2022-10-31 03:04 /user/itversity/.sparkStaging
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:49 /user/itversity/retail_db


In [5]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

In [6]:
%%sh

hdfs dfs -help put

-put [-f] [-p] [-l] [-d] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                       
  -p  Preserves access and modification times, ownership and the mode. 
  -f  Overwrites the destination if it already exists.                 
  -l  Allow DataNode to lazily persist the file to disk. Forces        
         replication factor of 1. This flag will result in reduced
         durability. Use with care.
                                                        
  -d  Skip creation of temporary file(<dst>._COPYING_). 


In [7]:
%%sh

hdfs dfs -help copyFromLocal

-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                                 
  -p                 Preserves access and modification times, ownership and the  
                     mode.                                                       
  -f                 Overwrites the destination if it already exists.            
  -t <thread count>  Number of threads to be used, default is 1.                 
  -l                 Allow DataNode to lazily persist the file to disk. Forces   
                     replication factor of 1. This flag will result in reduced   
                     durability. Use with care.                                  
  -d                 Skip creation of temporary file(<dst>._COPYING_).           


```{warning}
This will copy the entire folder to `/user/${USER}/retail_db` and you will see `/user/${USER}/retail_db/retail_db`. You can use the next command to get files as expected.
```

In [8]:
%%sh

ls -ltr /data/retail_db

total 20152
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 categories
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 customers
-rw-rw-r-- 1 itversity itversity     1748 Oct 30 20:38 create_db_tables_pg.sql
-rw-rw-r-- 1 itversity itversity 10303297 Oct 30 20:38 create_db.sql
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 departments
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 order_items
-rw-rw-r-- 1 itversity itversity 10297372 Oct 30 20:38 load_db_tables_pg.sql
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 orders
drwxrwxr-x 2 itversity itversity     4096 Oct 30 20:38 products


In [9]:
%%sh

hdfs dfs -put /data/retail_db /user/`whoami`/retail_db

In [10]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 1 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:52 /user/itversity/retail_db/retail_db


In [11]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db/retail_db

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:51 /user/itversity/retail_db/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-11-07 01:51 /user/itversity/retail_db/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-11-07 01:52 /user/itversity/retail_db/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:51 /user/itversity/retail_db/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:51 /user/itversity/retail_db/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-11-07 01:52 /user/itversity/retail_db/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:51 /user/itversity/retail_db/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:51 /user/itversity/retail_db/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:52 /user/itve

```{note}
Let's drop this folder and make sure files are copied as expected. As the folder is pre-created, we can use patterns to copy the sub folders.
```

In [11]:
%%sh

hdfs dfs -help rm

-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories.                                   
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>.  
  -safely     option requires safety confirmation, if enabled, requires          
              confirmation before deleting large directory with more than        
              <hadoop.shell.delete.limit.num.files> files. Delay is expected when
              walking over large directory recursively to count the number of    
              files to be deleted before the confirmation.                       


In [12]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db/retail_db

Deleted /user/itversity/retail_db/retail_db


In [13]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db/

In [14]:
%%sh

hdfs dfs -put /data/retail_db/order* /user/`whoami`/retail_db

In [15]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db/

Found 2 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:52 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:52 /user/itversity/retail_db/orders


In [16]:
%%sh

hdfs dfs -put -f /data/retail_db/* /user/`whoami`/retail_db

In [17]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db/

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-11-07 01:53 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-11-07 01:53 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-11-07 01:53 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/products


In [18]:
%%sh

hdfs dfs -ls -R /user/`whoami`/retail_db/

drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup       1029 2022-11-07 01:53 /user/itversity/retail_db/categories/part-00000
-rw-r--r--   1 itversity supergroup   10303297 2022-11-07 01:53 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-11-07 01:53 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/customers
-rw-r--r--   1 itversity supergroup     953719 2022-11-07 01:53 /user/itversity/retail_db/customers/part-00000
drwxr-xr-x   - itversity supergroup          0 2022-11-07 01:53 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup         60 2022-11-07 01:53 /user/itversity/retail_db/departments/part-00000
-rw-r--r--   1 itversity supergroup   10297372 2022-11-07 01:53 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity superg

```{note}
Alternatively you can use `copyFromLocal` as well.
```

In [19]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

Deleted /user/itversity/retail_db


In [20]:
%%sh

hdfs dfs -mkdir /user/`whoami`/retail_db

In [21]:
%%sh

hdfs dfs -ls /user/itversity/retail_db/

In [22]:
%%sh

hdfs dfs -copyFromLocal /data/retail_db/* /user/`whoami`/retail_db

In [23]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-05-29 17:23 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-05-29 17:23 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-05-29 17:23 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/products


```{note}
We can also use this alternative approach to directly copy the folder `/data/retail_db` to `/user/${USER}/retail_db`. Let us first delete `/user/${USER}/retail_db` using `skipTrash`.
```

In [24]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

Deleted /user/itversity/retail_db


```{note}
We can specify the target location as `/user/${USER}`. It will create the retail_db folder and its contents.
```

In [25]:
%%sh

hdfs dfs -put /data/retail_db /user/`whoami`

In [26]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-05-29 17:23 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-05-29 17:23 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-05-29 17:23 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:23 /user/itversity/retail_db/products


* If we try to run `hdfs dfs -put /data/retail_db /user/${USER}` again it will fail as the target folder already exists.

In [27]:
%%sh

hdfs dfs -put /data/retail_db /user/`whoami`

put: `/user/itversity/retail_db/categories/part-00000': File exists
put: `/user/itversity/retail_db/create_db.sql': File exists
put: `/user/itversity/retail_db/create_db_tables_pg.sql': File exists
put: `/user/itversity/retail_db/customers/part-00000': File exists
put: `/user/itversity/retail_db/departments/part-00000': File exists
put: `/user/itversity/retail_db/load_db_tables_pg.sql': File exists
put: `/user/itversity/retail_db/order_items/part-00000': File exists
put: `/user/itversity/retail_db/orders/part-00000': File exists
put: `/user/itversity/retail_db/products/part-00000': File exists


CalledProcessError: Command 'b'\nhdfs dfs -put /data/retail_db /user/`whoami`\n'' returned non-zero exit status 1.

* We can use `-f` as part of `put` or `copyFromLocal` to replace existing folder.

In [22]:
%%sh

hdfs dfs -put -f /data/retail_db /user/`whoami`

In [23]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-11-07 02:23 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-11-07 02:23 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-11-07 02:23 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-11-07 02:23 /user/itversity/retail_db/products


In [30]:
%%sh

hdfs dfs -ls -R /user/`whoami`/retail_db

drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:24 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup       1029 2022-05-29 17:24 /user/itversity/retail_db/categories/part-00000
-rw-r--r--   1 itversity supergroup   10303297 2022-05-29 17:24 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-05-29 17:24 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:24 /user/itversity/retail_db/customers
-rw-r--r--   1 itversity supergroup     953719 2022-05-29 17:24 /user/itversity/retail_db/customers/part-00000
drwxr-xr-x   - itversity supergroup          0 2022-05-29 17:24 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup         60 2022-05-29 17:24 /user/itversity/retail_db/departments/part-00000
-rw-r--r--   1 itversity supergroup   10297372 2022-05-29 17:24 /user/itversity/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itversity superg