## Basic sqoop import Command

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--target-dir sqoop_import_dir

1. Target-dir must not exist beforehand. If it exists, the command while fail with the same error. 
2. This command will create the directory while sqooping, then would put the table data inside this directory as part files. 
3. There would be no subfolder created inside target-dir with the table name, datafiles would be put directly inside target-dir.

If target-dir is not provided, sqoop will create a directory in the current working directory with the same name as table name.

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user

## Controlling Parellelism

1. To leverage parallelism, we need to provide number of mappers in the import command. There would be that many parallel imports executed, as mentioned in the --num-mappers parameter. 
2. That many number of part files would get created as there are number of mappers mentioned in the sqoop import command

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--target-dir sqoop_import_dir \
--num-mappers 3


3. When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form ```SELECT * FROM sometable WHERE id >= lo AND id < hi```, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. **(In the above sqoop, user_id is the primary key of the table 'user', which has been used as splitting key as can be seen from the sqoop log printed on console)**

4. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. This is also applicable for sqooping from a table which does not have a primary key, hence split-by column has to be mentioned **(Below we are sqooping from a mysql table 'logs' which had the columns logid,logdate,jobname,stepname,status,error_message. None of these has been made primary key of the table. Hence we have to either mention a column with --split-by parameter, or perform a sequential import using '-m 1'. Else it will throw error -** *ERROR tool.ImportTool: Error during import: No primary key could be found for table logs. Please specify one with --split-by or perform a sequential import with '-m 1'.*

5. The split-by column has to be an integer column. If a textual column is passed, it would throw error - *ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Generating splits for a textual index column allowed only
 in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter* 

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table logs \
--target-dir sqoop_import_dir \
--num-mappers 3 \
--split-by logid


### Warehouse-dir parameter

**1. warehouse-dir may exist beforehand, if does not exist it will be created while sqooping**
2. During sqooping, a subfolder would be created inside the warehouse-dir with same name as the table name
3. Data would be put inside this subfolder as part files

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--warehouse-dir sqoop_import_dir


### Using Options Files to Pass Arguments

1. When using Sqoop, the command line options that do not change from invocation to invocation can be put in an options file for convenience. An options file is a text file where each line identifies an option in the order that it appears otherwise on the command line.
2. To specify an options file, simply create an options file in a convenient location and pass it to the command line via --options-file argument.

import.txt \
import \
--connect \
jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username \
sqoopuser \
--password \
NHkkP876rp





In [0]:
sqoop --options-file /home/dataenggdatascfreelance1247/import.txt --table user


### Secure way of supplying password to the database

You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials. Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means with out exposing the password in the job configuration. The file containing the password can either be on the Local FS or HDFS

password.txt \
NHkkP876rp



In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser 
--password-file /home/dataenggdatascfreelance1247/password.txt


### Selecting the Data to Import

By default, all columns within a table are selected for import. You can select 
a subset of columns and control their ordering by using the --columns argument


In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--columns name,age,country

You can append a WHERE clause to this with the --where argument. For example: --where "id > 400"

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--columns name,age,country \
--where "user_id > 2"

### Free-form Query Imports

1. Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.
2. When importing a free-form query, you must specify a destination directory with --target-dir.
3. When importing query results in parallel, you must specify --split-by.
4. Must provide \$CONDITIONS in the query irrespective of whether actually using a where clause in the query or not.

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--query "SELECT * FROM user where \$CONDITIONS" \
--target-dir sqoop_import_dir \
--split-by user_id

## Incremental Imports

1. The following arguments control incremental imports: 
--check-column (col) : Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)\
--incremental (mode) : Specifies how Sqoop determines which rows are new. Legal values for mode include 'append' and 'lastmodified'.\
--last-value (value) : Specifies the maximum value of the check column from the previous import.

2. You should specify 'append' mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value (last-value will be that value which has been last imported from the table).

3. You should use 'lastmodified' when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

4. If --last-value is noe supplied while importing, it will import all the data of the table.

5. At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen.

In [0]:
sqoop import \
--connect jdbc:mysql://cxln2.c.thelab-240901.internal/sqoopex \
--username sqoopuser \
--password NHkkP876rp \
--table user \
--check-column user_id \
--incremental append \
--last-value 5