# Spark Job によるデータ取り込み

## 1. データマートに取り込み先のテーブルを作成


**1. ストリーミングデータの格納用テーブルの作成**

In [3]:
USE [DataPool];

-- オブジェクトの初期化
IF EXISTS (SELECT * FROM sys.external_tables WHERE name = 'web_clickstreams_spark_results')
BEGIN
	DROP EXTERNAL TABLE web_clickstreams_spark_results
END;
GO

In [4]:
USE [DataPool];

CREATE EXTERNAL TABLE [web_clickstreams_spark_results]
(
	wcs_click_date_sk BIGINT , 
	wcs_click_time_sk BIGINT , 
	wcs_sales_sk BIGINT , 
	wcs_item_sk BIGINT , 
	wcs_web_page_sk BIGINT , 
	wcs_user_sk BIGINT)
WITH
(
    DATA_SOURCE = SqlDataPool,
    DISTRIBUTION = ROUND_ROBIN
);


**2. データプールの SQL Server にテーブルが作成されたことを確認**

In [5]:
USE [DataPool];

SELECT
	(SELECT name from [DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.servers WHERE server_id = 0) AS server_name
	, o.name
	, i.name
	, i.type_desc
	, (SELECT COUNT(*) FROM [DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.dbo.web_clickstreams_spark_results) AS count
FROM
	[DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.indexes AS i
	LEFT JOIN 
	[DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.objects AS o
	ON i.object_id = o.object_id
WHERE
	o.name = 'web_clickstreams_spark_results'
UNION
SELECT
	(SELECT name from [DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.servers WHERE server_id = 0) AS server_name
	, o.name
	, i.name
	, i.type_desc
	, (SELECT COUNT(*) FROM [DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.dbo.web_clickstreams_spark_results) AS count
FROM
	[DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.indexes AS i
	LEFT JOIN 
	[DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.objects AS o
	ON i.object_id = o.object_id
WHERE
	o.name = 'web_clickstreams_spark_results'
GO

server_name,name,name.1,type_desc,count
data-0-0,web_clickstreams_spark_results,cci,CLUSTERED COLUMNSTORE,0
data-0-1,web_clickstreams_spark_results,cci,CLUSTERED COLUMNSTORE,0


## 2. Spark Job のサブミット
Azure Data Studio で次の操作を実行し、Spark Job によるデータ取り込みを実行


1. HDFS 上の jar/mssql-spark-lib/assembly-1.0.jar を右クリックして、Submit spark job を選択
1. 次の設定でジョブを実行
```
    Main class: FileStreaming
    Arguments : 
	--server master-0.master-svc --port 1433 --user sa --password P@ssw0rd --database DataPool --table web_clickstreams_spark_results --source_dir hdfs:///clickstream_data --input_format csv --enable_checkpoint false
```
3. Output の History Url と Yarn UI から情報を確認  
![yarn](https://github.com/MasayukiOzawa/decode-2019-demo/raw/master/Images/04.Integrated%20Data%20Access/01.Spark%20Job/Yarn.png)  
![Spark Job](https://github.com/MasayukiOzawa/decode-2019-demo/raw/master/Images/04.Integrated%20Data%20Access/01.Spark%20Job/Spark%20Job.png)

## 3. ジョブの停止
```
kubectl exec -n mssql-cluster -it master-0 -c hadoop /bin/bash
yarn application -list
yarn application -kill application_
```


## 4. データの確認

**1. クエリによるデータの確認**

In [8]:
USE [DataPool];

SELECT COUNT(*) FROM [web_clickstreams_spark_results]

(No column name)
998


**2. データプールのデータ取り込み状況の確認**

In [9]:
USE [DataPool];

SELECT
	(SELECT name from [DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.servers WHERE server_id = 0) AS server_name
	, o.name
	, i.name
	, i.type_desc
	, (SELECT COUNT(*) FROM [DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.dbo.web_clickstreams_spark_results) AS count
FROM
	[DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.indexes AS i
	LEFT JOIN 
	[DATA-0-0.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.objects AS o
	ON i.object_id = o.object_id
WHERE
	o.name = 'web_clickstreams_spark_results'
UNION
SELECT
	(SELECT name from [DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.servers WHERE server_id = 0) AS server_name
	, o.name
	, i.name
	, i.type_desc
	, (SELECT COUNT(*) FROM [DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.dbo.web_clickstreams_spark_results) AS count
FROM
	[DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.indexes AS i
	LEFT JOIN 
	[DATA-0-1.DATA-0-SVC.MSSQL-CLUSTER.SVC.CLUSTER.LOCAL].DataPool.sys.objects AS o
	ON i.object_id = o.object_id
WHERE
	o.name = 'web_clickstreams_spark_results'
GO

server_name,name,name.1,type_desc,count
data-0-0,web_clickstreams_spark_results,cci,CLUSTERED COLUMNSTORE,1497
data-0-1,web_clickstreams_spark_results,cci,CLUSTERED COLUMNSTORE,499
