Skip to content

Commit

Permalink
fixing some issues in my HDI docs
Browse files Browse the repository at this point in the history
  • Loading branch information
mumian committed Mar 29, 2013
1 parent 86b9bc1 commit ece1f6c
Show file tree
Hide file tree
Showing 8 changed files with 73 additions and 50 deletions.
6 changes: 3 additions & 3 deletions ITPro/Services/hdinsight/blob-hive-sql.md
Expand Up @@ -83,7 +83,7 @@ In this tutorial, you will use the Hive console to run the Hive queries. The ot

#ls asv://flightinfo@StorageAccountName.blob.core.windows.net/delays

You will get the list of files you uploaded using Azure STorage Explorer.
You will get the list of files you uploaded using Azure Storage Explorer.

##<a id="createtable"></a>Create a Hive Table and Populate Data
The next step is to create a Hive table from the data in Azure Storage Vault (ASV)/Blog storage.
Expand Down Expand Up @@ -197,13 +197,13 @@ The next step is to create a Hive table from the data in Azure Storage Vault (AS
##<a id="executequery"></a>Execute a HiveQL Query
After the *delays* table has been created, you are now ready to run queries against it.

1. Replace **username** in the following query with the username you used to log into the cluster, and then copy and paste the followingquery into the query pane
1. Replace **username** in the following query with the username you used to log into the cluster, and then copy and paste the following query into the query pane

INSERT OVERWRITE DIRECTORY '/user/username/queryoutput' select regexp_replace(origin_city_name, '''', ''), avg(weather_delay) from delays where weather_delay is not null group by origin_city_name;

This query computes the average weather delay and groups the results by city name. It will also output the results to HDFS. Note that the query will remove apostrophes from the data and will exclude rows where the value for *weather_deal*y is *null*, which is necessary because Sqoop, used in the next step, doesn't handle those values gracefully by default.

2. Click **Evaluate**.Output from the query above should look similar to the following:
2. Click **Evaluate**. The output from the query above should look similar to the following:

Hive history file=c:\apps\dist\hive-0.9.0\logs/hive_job_log_RD00155D47138A$_201303220108_1260638792.txt
Logging initialized using configuration in file:/C:/apps/dist/hive-0.9.0/conf/hive-log4j.properties
Expand Down
63 changes: 43 additions & 20 deletions ITPro/Services/hdinsight/upload-data.md
Expand Up @@ -4,7 +4,7 @@

#How to Upload Data to HDInsight

Windows Azure HDInsight Service provides two options in how it manages its data, Windows Azure Blob Storage and Hadoop Distributed File System (HDFS). HDFS is designed to store data used by Hadoop applications. Data stored in Windows Azure Blob Storage can be accessed by Hadoop applications using Windows Azure Storage Vault (ASV), which provides a full featured HDFS file system over Windows Azure Blob storage. It has been designed as an HDFS extension to provide a seamless experience to customers by enabling the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Both options are distinct file systems that are optimized for storage of data and computations on that data.
Windows Azure HDInsight Service provides two options in how it manages its data, Azure Storage Vault (ASV) and Hadoop Distributed File System (HDFS). HDFS is designed to store data used by Hadoop applications. Data stored in Windows Azure Blob Storage can be accessed by Hadoop applications using Windows Azure Storage Vault (ASV), which provides a full featured HDFS file system over Windows Azure Blob storage. It has been designed as an HDFS extension to provide a seamless experience to customers by enabling the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Both options are distinct file systems that are optimized for storage of data and computations on that data. For the benefits of using ASV, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/howto-blob-store/).

Windows Azure HDInsight clusters are typically deployed to execute MapReduce jobs and are dropped once these jobs have been completed. Keeping the data in the HDFS clusters after computations have been completed would be an expensive way to store this data. Windows Azure Blob storage is a highly available, highly scalable, high capacity, low cost, and shareable storage option for data that is to be processed using HDInsight. Storing data in a Blob enables the HDInsight clusters used for computation to be safely released without losing data.

Expand All @@ -19,16 +19,16 @@ Windows Azure Blob storage can either be accessed through the [API](http://www.w
##Table of Contents

* [How to: Upload data to Windows Azure Storage using Azure Storage Explorer](#storageexplorer)
* [How to: Access data in Windows Azure Storage](#blob)
* [How to: Upload data to HDFS using Interactive JavaScript Console](#console)
* [How to: Upload data to HDFS using Hadoop command line](#commandline)
* [How to: Import data from Windows Azure SQL Database to HDFS using Sqoop](#sqoop)
* [How to: Access data stored in ASV](#blob)
* [How to: Upload data to ASV using Interactive JavaScript Console](#console)
* [How to: Upload data to ASV using Hadoop command line](#commandline)
* [How to: Import data from Windows Azure SQL Database to ASV using Sqoop](#sqoop)

##<a id="storageexplorer"></a>How to: Upload data to Windows Azure Storage using Azure Storage Explorer
##<a id="storageexplorer"></a>How to: Upload Data to Windows Azure Storage Using Azure Storage Explorer

*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").

Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How To Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).

1. Run Azure Storage Explorer.

Expand All @@ -47,28 +47,31 @@ Before using the tool, you must know your Windows Azure storage account name and
6. From **Blob**, click **Upload**.
7. Specify a file to upload, and then click **Open**.

Blob storage containers store data as key/value pairs, and there is no directory hierarchy. However the ‘/’ character can be used within the key name to make it appear as if a file is stored within a directory structure. For example, a blob’s key may be ‘input/log1.txt’. No actual ‘input’ directory exists, but due to the presence of the ‘/’ character in the key name, it has the appearance of a file path. You can click **Rename** from the tool to give a file a folder structure.

##<a id="blob"></a>How to: Access data stored in Windows Azure Blob Storage
##<a id="blob"></a>How to: Access Data Stored in Azure Storage Vault

Data stored in Windows Azure Blob Storage can be accessed directly from the Interactive JavaScript Console by prefixing the protocol scheme of the URI for the assets you are accessing with asv://. To secure the connection, use asvs://. The scheme for accessing data in Windows Azure Blob Storage is:

asvs://[<container>@]<accountname>.blob.core.microsoft.com/<path>
asv[s]://[<container>@]<accountname>.blob.core.windows.net/<path>

The following is an example of viewing data stored in Windows Azure Blob Storage using the Interactive Javascript Console:
The following is an example of viewing data stored in Windows Azure Blob Storage using the Interactive JavaScript Console:

![HDI.ASVSample](../media/HDI.ASVSample.png "ASV sample")

The following will run a Hadoop streaming job that uses Windows Azure Blob Storage for both input and output:

Hadoop jar hadoop-streaming.jar
-files "hdfs:///example/apps/map.exe, hdfs:///example/apps/reduce.exe"
-input "asv://iislogsinput/iislogs.txt"
-output "asv://iislogsoutput/results.txt"
-input "asvs://container@storageaccount.blob.core.windows.net/iislogsinput/iislogs.txt"
-output "asvs://container@storageaccount.blob.core.windows.net/iislogsoutput/results.txt"
-mapper "map.exe"
-reducer "reduce.exe"

For more information on accessing the files stored in ASV, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/howto-blob-store/).

##<a id="console"></a> How to: Upload data to HDFS using interactive JavaScript console

##<a id="console"></a> How to: Upload Data to ASV Using Interactive JavaScript Console
Windows Azure HDInsight Service comes with a web based interactive JavaScript console that can be used as an administration/deployment tool.

1. Sign in to the [Management Portal](https://manage.windowsazure.com).
Expand All @@ -88,12 +91,21 @@ Windows Azure HDInsight Service comes with a web based interactive JavaScript co

![HDI.fs.put](../media/HDI.fsput.png "fs.put()")

9. Enter **Source** and **Destination**, and then click **Upload**.
9. Enter **Source** and **Destination**, and then click **Upload**. Here are some sample values for the the Destination field:

<table border="1">
<tr><th>Sample</th><th>Note</th></tr>
<tr><td>.</td><td>refer to /user/&lt;currentloggedinuser&gt; on the default file system.</td></tr>
<tr><td>/</td><td>refer to / on the default file system.</td></tr>
<tr><td>asv:/// or asvs://container@accountname.blob.core.windows.net</td><td>refer to / on teh default file system.</td></tr>
</table>


10. Use the following command to list the uploaded files.

#ls /
#ls <path>

##<a id="commandline"></a> How to: Upload data to HDFS using Hadoop command line
##<a id="commandline"></a> How to: Upload Data to ASV Using Hadoop Command Line

To use Hadoop command line, you must first connect to the cluster using remote desktop.

Expand All @@ -114,12 +126,22 @@ To use Hadoop command line, you must first connect to the cluster using remote d

hadoop dfs -copyFromLocal C:\temp\davinci.txt /example/data/davinci.txt

Because the default file system is on ASV, /example/datadavinci.txt is actually on ASV. You can also refer to the file as:

asv:///example/data/davinci.txt

or

asvs://container@accountname.blob.core.windows.net/example/data/davinci.txt

The FQDN is required when you use asvs.

13. Use the following command to list the uploaded files:

hadoop dfs -lsr /example/data


##<a id="sqoop"></a> How to: Import data to HDFS from SQL Database/SQL Server using Sqoop
##<a id="sqoop"></a> How to: Import Data to HDFS from SQL Database/SQL Server Using Sqoop

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use it to import data from a relational database management system (RDBMS) such as SQL or MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into a RDBMS. For more information, see [Sqoop User Guide](http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html).

Expand All @@ -137,13 +159,13 @@ Before importing data, you must know the Windows Azure SQL Database server name,
12. Run a command similar to the following:

sqoop import
--connect "jdbc:sqlserver://s6ok0p9kfz.database.windows.net;username=user1@s6ok0p9kfz;password=Pass@word1;database=AdventureWorks2012"
--connect "jdbc:sqlserver://s6ok0p9kft.database.windows.net;username=user1@s6ok0p9kft;password=Pass@word1;database=AdventureWorks2012"
--table Sales.SalesOrderDetail
--columns "SalesOrderID,SalesOrderDetailID,CarrierTrackingNumber,OrderQty,ProductID,SpecialOfferID,UnitPrice,UnitPriceDiscount,LineTotal"
--target-dir /data/lineitemData
-m 1

In the command, the SQL database server is *s6ok0p9kfz*, username is *user1*, password is *Pass@word1*, and the database is *AdventureWorks2012*.
In the command, the SQL database server is *s6ok0p9kft*, username is *user1*, password is *Pass@word1*, and the database is *AdventureWorks2012*.

13. You can run the #tail command from the Interactive Console to see the result:

Expand All @@ -160,8 +182,9 @@ Note: When specifying an escape character as delimiter with the arguments *--inp

## Next Steps
Now that you understand how to get data into HDInsight Service, use the following tutorials to learn how to perform analyis:
Now that you understand how to get data into HDInsight Service, use the following tutorials to learn how to perform analysis:

* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
* [Tutorial: Using MapReduce with HDInsight](/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/)
* [Tutorial: Using Hive with HDInsight](/en-us/manage/services/hdinsight/using-hive-with-hdinsight/)
* [Tutorial: Using Pig with HDInsight](/en-us/manage/services/hdinsight/using-pig-with-hdinsight/)
2 changes: 1 addition & 1 deletion ITPro/Services/hdinsight/using-blob-store.md
Expand Up @@ -35,7 +35,7 @@ The HDInsight Service provides access to the distributed file system that is loc

In addition, HDInsight Service provides the ability to access data stored in Blob Storage containers. The syntax to access ASV is:

asv[s]://[<container>@]<accountname>.blob.core.microsoft.net/<path>
asv[s]://[<container>@]<accountname>.blob.core.windows.net/<path>


Hadoop supports a notion of default file system. The default file system implies a default scheme and authority; it can also be used to resolve relative paths. During the HDInsight provision process, user must specify a Blob Storage and a container used as the default file system.
Expand Down
6 changes: 3 additions & 3 deletions ITPro/Services/hdinsight/using-hdinsight-sdk.md
Expand Up @@ -21,12 +21,12 @@ You can install latest published build of the library from [NuGet](http://nuget.

* **MapReduce library:** This library simplifies writing MapReduce jobs in .NET languages using the Hadoop streaming interface.
* **LINQ to Hive client library:** This library translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster. This library can execute arbitrary HiveQL queries from a .NET program as well.
* **WebClient library:** This libarary contains client libraries for *WebHDFS* and *WebHCat*.
* **WebClient library:** This library contains client libraries for *WebHDFS* and *WebHCat*.

* **WebHDFS client library:** It works with files in HDFS and Windows Azure Blog Storage
* **WebHCat client library:** It manages scheduling and execution of jobs in HDInsight cluster

The NuGet syntax to install the librarys:
The NuGet syntax to install the libraries:

install-package Microsoft.Hadoop.MapReduce
install-package Microsoft.Hadoop.Hive
Expand Down Expand Up @@ -54,7 +54,7 @@ In this section you will learn how to upload files to Hadoop cluster programmati

<table>
<tr><th>Property</th><th>Value</th></tr>
<tr><td>Catagory</td><td>Templates/Visual C#/Windows</td></tr>
<tr><td>Category</td><td>Templates/Visual C#/Windows</td></tr>
<tr><td>Template</td><td>Console Application</td></tr>
<tr><td>Name</td><td>SimpleHiveJob</td></tr>
</table>
Expand Down
8 changes: 4 additions & 4 deletions ITPro/Services/hdinsight/using-hive.md
Expand Up @@ -16,7 +16,7 @@ Hive provides a means of running MapReduce job through an SQL-like scripting lan

##In this Article

* [The Hive Usage case](#usage)
* [The Hive usage case](#usage)
* [Upload a sample log4j file to Windows Azure Blob Storage](#uploaddata)
* [Connect to the interactive console](#connect)
* [Create a Hive table and upload data to the table](#createhivetable)
Expand Down Expand Up @@ -51,7 +51,7 @@ In this tutorial, you will complete the following tasks:

##<a id="uploaddata"></a>Upload a Sample Log4j File to Windows Azure Blob Storage

HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/howto-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.

*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").

Expand Down Expand Up @@ -123,7 +123,7 @@ Before using the tool, you must know your Windows Azure storage account name and
12. Click **Close**.
13. From the **File** menu, click **Exit** to close Azure Storage Explorer.

For accessing ASV, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store/)
For accessing ASV, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/howto-blob-store/)

##<a id="connect"></a> Connect to the Interactive Console

Expand Down Expand Up @@ -162,7 +162,7 @@ You must have an HDInsight cluster previsioned before you can work on this tutor

To use asvs, you must provide the FQDN. For example to access sample.log on the default file system:

#ls asvs://container@storagename.blob.core.microsoft.net/sample.log 
#ls asvs://container@storagename.blob.core.windows.net/sample.log 


##<a id="createhivetable"></a> Create a Hive Table and Upload Data to the Table
Expand Down

0 comments on commit ece1f6c

Please sign in to comment.