Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Merge pull request #1075 from mumian/jgaoHDI20130325

refresh the 9 articles assigned to me
  • Loading branch information...
commit 87fde68ff3f919bbb51023b6d1162288a3329298 2 parents 532e769 + 6faaccf
@mollybostic mollybostic authored
Showing with 495 additions and 337 deletions.
  1. +41 −35 ITPro/Services/hdinsight/blob-hive-sql.md
  2. +1 −1  ITPro/Services/hdinsight/upload-data.md
  3. +134 −68 ITPro/Services/hdinsight/using-blob-store.md
  4. +40 −26 ITPro/Services/hdinsight/using-hdinsight-sdk.md
  5. +63 −43 ITPro/Services/hdinsight/using-hive.md
  6. +164 −118 ITPro/Services/hdinsight/using-mapreduce.md
  7. +52 −46 ITPro/Services/hdinsight/using-pig.md
  8. BIN  ITPro/Services/media/ASV-files.png
  9. BIN  ITPro/Services/media/HDI.ASEBlob.png
  10. BIN  ITPro/Services/media/HDI.ASEUploadFile.png
  11. BIN  ITPro/Services/media/HDI.ASVSample.PNG
  12. BIN  ITPro/Services/media/HDI.ClusterSummary.png
  13. BIN  ITPro/Services/media/HDI.CustomCreateStorageAccount.png
  14. BIN  ITPro/Services/media/HDI.Dashboard1.png
  15. BIN  ITPro/Services/media/HDI.HadoopCommandLine.png
  16. BIN  ITPro/Services/media/HDI.HiveConsole.png
  17. BIN  ITPro/Services/media/HDI.IJCListFile.png
  18. BIN  ITPro/Services/media/HDI.InteractiveJavaScriptConsole.png
  19. BIN  ITPro/Services/media/HDI.JobHistoryPage.png
  20. BIN  ITPro/Services/media/HDI.MapReduceResults.png
  21. BIN  ITPro/Services/media/HDI.MonitorPage.png
  22. BIN  ITPro/Services/media/HDI.QuickCreate.png
  23. BIN  ITPro/Services/media/HDI.fsput.png
View
76 ITPro/Services/hdinsight/blob-hive-sql.md
@@ -4,20 +4,21 @@
#Using HDInsight to Process Blob Storage Data and Write the Results to a SQL Database
-This tutorial will show you how to use the Windows Azure HDInsight Service to process data stored in Windows Azure Blob Storage and move the results to a Windows Azure SQL Database. To enable the HDInsight preview, click [here](https://account.windowsazure.com/PreviewFeatures). For more information on Using Windows Azure Blob storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store/)
+Hive provides a means of running MapReduce job through an SQL-like scripting language, called HiveQL, which can be applied towards summarization, querying, and analysis of large volumes of data. This tutorial will show you how to use HiveQL to process data stored in Windows Azure Blob Storage and move the results to a Windows Azure SQL Database.
**Estimated time to complete:** 30 minutes
##In this article:
-* Download the test data
-* Upload data to Windows Azure Blob Storage
-* Connect to the Hive console
-* Create a Hive table and populate data
-* Execute a HiveQL Query
-* Export data from HDFS to Windows Azure SQL Database
+* [Download the test data](#downloaddata)
+* [Upload data to Windows Azure Blob Storage](#uploaddata)
+* [Connect to the Hive console](#connect)
+* [Create a Hive table and populate data](#createtable)
+* [Execute a HiveQL Query](#executequery)
+* [Export data from HDFS to Windows Azure SQL Database](#exportdata)
+* [Next Steps](#nextsteps)
-## Download the Test Data
+##<a id="downloaddata"></a>Download the Test Data
In this tutorial, you will use the on-time performance of airline flights data from [Research and Innovative Technology Administration, Bureau of Transportation Statistics][rita-website] (RITA).
1. Browse to [Research and Innovative Technology Administration, Bureau of Transportation Statistics][rita-website] (RITA).
@@ -32,41 +33,39 @@ In this tutorial, you will use the on-time performance of airline flights data f
3. Click **Download**. Each file could take up to 15 minutes to download.
4. Unzip the file to the **C:\Tutorials** folder. Each file is a CSV file and is approximately 60 GB in size.
-5. Rename the file to the name of the month that it contains data for. For example, the file containing the January data would be named January.csv.
+5. Rename the file to the name of the month that it contains data for. For example, the file containing the January data would be named *January.csv*.
6. Repeat step 2 and 5 to download a file for each of the 12 months in 2012.
-##Upload Data to Windows Azure Blob Storage
-
-You must have a [Windows Azure subscription][free-trial], and a [Windows Azure Storage Account][create-storage-account] before you can proceed. You must also know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
+##<a id="uploaddata"></a>Upload Data to Windows Azure Blob Storage
+HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").
-2. Run **Azure Storage Explorer**.
+Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions on creating a Windows Azure Storage account, see [How To Create a Storage Account](/en-us/manage/services/storage/how-to-create-a-storage-account/). For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
+
+1. Run **Azure Storage Explorer**.
![HDI.AzureStorageExplorer](../media/HDI.AzureStorageExplorer.png "Azure Storage Explorer")
-3. Click **Add Account** if the Windows Azure storage account has not been added to the tool.
+2. Click **Add Account** if the Windows Azure storage account has not been added to the tool.
![HDI.ASEAddAccount](../media/HDI.ASEAddAccount.png "Add Account")
-4. Enter **Storage account name** and **Storage account key**, and then click **Add Storage Account**.
-
-5. From **Storage Type**, click **Blobs** to display the Windows Azure Blob storage of the account.
-
- ![HDI.ASEBlob](../media/HDI.ASEUploadFile.png "Azure Storage Explorer")
+3. Enter **Storage account name** and **Storage account key**, and then click **Add Storage Account**.
+4. From **Storage Type**, click **Blobs** to display the Windows Azure Blob storage of the account.
+5. From **Container**, click **New** to create a new container for the flight on-time data.
+6. Enter **flightinfo** as the container name, and then click **Create Container**.
+7. Click the **flightinfo** container to select it.
+8. From **Blob**, click **Upload**.
+9. Select the 12 files and then click **Open**.
+10. Select **January.csv**, and then click **Rename**.
+11. Prefix the name with **delays/**. When you are finished, you should have file names that look like this:
-6. From **Container**, click **New** to create a new container for the flight on-time data.
-7. Enter **flightinfo** as the container name, and then click **Create Container**.
-8. Click the **flightinfo** container to select it.
-9. From **Blob**, click **Upload**.
-10. Select the 12 files and then click **Open**.
-11. Select **January.csv**, and then click **Rename**.
-12. Prefix the name with **delays/**. When you are finished, you should have file names that look like this:
+ ![ASV files](../media/ASV-files.png "ASV files")
- ![ASV files][asv-files]
-
-## Connect to the Hive Console
+##<a id="connect"></a> Connect to the Hive Console
+You must have an HDInsight cluster previsioned before you can work on this tutorial. To enable the Windows Azure HDInsight Service preview, click [here](https://account.windowsazure.com/PreviewFeatures). For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight/) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/).
In this tutorial, you will use the Hive console to run the Hive queries. The other options is Hadoop Command Line from remote desktop.
@@ -79,12 +78,18 @@ In this tutorial, you will use the Hive console to run the Hive queries. The ot
![HDI.TileInteractiveConsole](../media/HDI.TileInteractiveConsole.png "Interactive Console")
-7. Click **Hive** on the upper right corner.
+7. Click **JavaScript** on the upper right corner.
+8. Replace **StorageAccountName** in the following command with your storage account name, and then run the command:
+
+ #ls asv://flightinfo@StorageAccountName.blob.core.windows.net/delays
+
+ You will get the list of files you uploaded using Azure STorage Explorer.
-##Create a Hive Table and Populate Data
+##<a id="createtable"></a>Create a Hive Table and Populate Data
The next step is to create a Hive table from the data in Azure Storage Vault (ASV)/Blog storage.
-1. Replace **storageaccountname** in the following query with your Windows Azure Storage account name, and then copy and paste the following code to the query pane:
+1. From the Interactive console, click **Hive** on the upper right corner.
+2. Replace **storageaccountname** in the following query with your Windows Azure Storage account name, and then copy and paste the following code to the query pane:
create external table delays_raw (
YEAR string,
@@ -189,7 +194,7 @@ The next step is to create a Hive table from the data in Azure Storage Vault (AS
OK
Time taken: 139.283 seconds
-##Execute a HiveQL Query
+##<a id="executequery"></a>Execute a HiveQL Query
After the *delays* table has been created, you are now ready to run queries against it.
1. Replace **username** in the following query with the username you used to log into the cluster, and then copy and paste the followingquery into the query pane
@@ -235,7 +240,7 @@ After the *delays* table has been created, you are now ready to run queries agai
js> #cat asv:///user/username/queryoutput/000000_0
-##Export Data from HDFS to Windows Azure SQL Database
+##<a id="exportdata"></a>Export Data from HDFS to Windows Azure SQL Database
Before copying data from HDFS to a Windows Azure SQL Database, the SQL Database must exist. To create a database, follow the instructions here: [Getting started with Windows Azure SQL Database](http://www.windowsazure.com/en-us/manage/services/sql-databases/getting-started-w-sql-databases/). Note that your table schema must match that of the data in HDFS and it must have a clustered index. To use the command below, create a database called **MyDatabase** and a table called **AvgDelays** with the following schema:
@@ -315,9 +320,10 @@ Before copying data from HDFS to a Windows Azure SQL Database, the SQL Database
![SQL results][sql-results]
-## Next Steps
+##<a id="nextsteps"></a> Next Steps
Now that you understand how to upload file to Blob storage, how to populate a Hive table using the data from Blob storage, how to run Hive queries, and how to use Sqoop to export data from HDFS to Windows Azure SQL Database. To learn more, see the following articles:
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
* [Tutorial: Using MapReduce with HDInsight](/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/)
* [Tutorial: Using Hive with HDInsight](/en-us/manage/services/hdinsight/using-hive-with-hdinsight/)
* [Tutorial: Using Pig with HDInsight](/en-us/manage/services/hdinsight/using-pig-with-hdinsight/)
View
2  ITPro/Services/hdinsight/upload-data.md
@@ -52,7 +52,7 @@ Before using the tool, you must know your Windows Azure storage account name and
Data stored in Windows Azure Blob Storage can be accessed directly from the Interactive JavaScript Console by prefixing the protocol scheme of the URI for the assets you are accessing with asv://. To secure the connection, use asvs://. The scheme for accessing data in Windows Azure Blob Storage is:
- asvs://container/path.
+ asvs://[<container>@]<accountname>.blob.core.microsoft.com/<path>
The following is an example of viewing data stored in Windows Azure Blob Storage using the Interactive Javascript Console:
View
202 ITPro/Services/hdinsight/using-blob-store.md
@@ -2,36 +2,79 @@
<div chunk="../chunks/hdinsight-left-nav.md" />
-#Using Windows Azure Blob Storage with HDInsight #
+#Using Windows Azure Blob Storage with HDInsight
-Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Windows Azure Blob Storage for storing data. Blob Storage is a robust, general purpose Windows Azure storage solution. An HDFS file system over Blob Storage is referred as Azure Storage Vault or ASV for short. ASV provides a full featured HDFS file system interface for Blob Storage that provides a seamless experience to customers by enabling the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not just a low cost solution. Storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data.
+Windows Azure HDInsight Service supports both Hadoop Distributed Files System (HDFS) and Azure Storage Vault (ASV) for storing data. Windows Azure Blob Storage is a robust, general purpose Windows Azure storage solution. ASV provides a full featured HDFS file system interface for Blob Storage that provides a seamless experience to customers by enabling the full set of components in the Hadoop ecosystem to operate (by default) directly on the data managed by Blog Storage. Blob Storage is not just a low cost solution. Storing data in Blob Storage enables the HDInsight clusters used for computation to be safely deleted without losing user data.
-**Note:** Most HDFS commands such as ls, copyFromLocal, mkdir etc. will still work as expected. Only the commands that are specific to the native HDFS implementation (which is referred to as DFS) such as fschk and dfsadmin will show different behavior.
+<div class="dev-callout"> 
+<b>Note</b> 
+<p>Most HDFS commands such as ls, copyFromLocal, mkdir etc. will still work as expected. Only the commands that are specific to the native HDFS implementation (which is referred to as DFS) such as fschk and dfsadmin will show different behavior on ASV.</p> 
+</div>
-In this Article
+To enable the Windows Azure HDInsight Service preview, click [here](https://account.windowsazure.com/PreviewFeatures).
- • The HDInsight service storage architecture
- • Provision the default file system
- • Accessing files in other blob storage
- • Addressing files in blob storage
- • Mapping an ASU URI to a Blob Storage URI
- • Mapping Files and Directories into Blob Storage Containers
- • Conclusion
- • Next Steps
+##In this Article
-##The HDInsight Service Storage Architecture
+* [The HDInsight service storage architecture](#architecture)
+* [Benefits of ASV](#benefits)
+* [Preparing Blob Storage Container for ASV](#preparingblobstorage)
+* [Addressing files in blob storage](#addressing)
+* [Next Steps](#nextsteps)
+
+##<a id="architecture"></a>The HDInsight Service Storage Architecture
The following diagram provides an abstract view of the HDInsight Service's storage architecture:
![HDI.ASVArch](../Media/HDI.ASVArch.gif "HDInsight Storage Architecture")
-The HDInsight Service provides access to the distributed file system that is locally attached to the compute nodes. This file system can be accessed using the fully qualified URI. For example: hdfs://&lt;namenodehost&gt;/&lt;path&gt;.
+The HDInsight Service provides access to the distributed file system that is locally attached to the compute nodes. This file system can be accessed using the fully qualified URI. For example:
+
+ hdfs://<namenodehost>/<path>
+
+In addition, HDInsight Service provides the ability to access data stored in Blob Storage containers. The syntax to access ASV is:
+
+ asv[s]://[<container>@]<accountname>.blob.core.microsoft.net/<path>
+
+
+Hadoop supports a notion of default file system. The default file system implies a default scheme and authority; it can also be used to resolve relative paths. During the HDInsight provision process, user must specify a Blob Storage and a container used as the default file system.
+
+
+Other than the Blob Storage container designated as the default file system, you can also access containers that reside in the same Windows Azure storage account or different Windows Azure storage accounts:
+
+* **Container in the same storage account:** Because the account name and key are stored in the core-site.xml, you have full access to the files in the container.
+* **Container in a different storage account with the *public container* or the *public blob* access level:** you have read only permission to the files in the container.
+
+ <div class="dev-callout"> 
+ <b>Note</b> 
+ <p>Public Container allows you to get a list of all blobs available in that container and get container metadata. Public Blob allows you to access the blobs only if you know the exact url. For more information, see [Restrict Access to Containers and Blobs](http://msdn.microsoft.com/en-us/library/windowsazure/dd179354.aspx).</p> 
+ </div>
+
+* **Container in a different storage account with the *private* access levels:** you must add a new entry for each storage account to the C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml files to be able to access the files in the container from HDInsight:
+
+ <property>
+ <name>fs.azure.account.key.<accountname>.blob.core.microsoft.com</name>
+ <value><enterthekeyvaluehere></value>
+ </property>
+
+ Note there is already an entry in the configuration file for the Blob Storage container used as the default file system.
+
+Be aware that accessing such a container may go outside of your subscription's data center, this may incur additional charges for data flowing across the data center boundaries. It should also always be encrypted using the ASVS URI scheme.
+
+
+Blob storage containers store data as key/value pairs, and there is no directory hierarchy. However the ‘/’ character can be used within the key name to make it appear as if a file is stored within a directory structure. For example, a blob’s key may be ‘input/log1.txt’. No actual ‘input’ directory exists, but due to the presence of the ‘/’ character in the key name, it has the appearance of a file path.
-In addition, HDInsight Service provides the ability to access data stored in Blob Storage containers, one of which is designated as the default file system during the provision process.
-##Benefits of ASV
+
+
+
+
+
+
+
+
+##<a id="benefits"></a>Benefits of ASV
The implied performance cost of not having compute and storage co-located is mitigated by the way the compute clusters are provisioned close to the storage account resources inside the Windows Azure data center, where the high speed network makes it very efficient for the compute nodes to access the data inside Blob Storage. Depending on general load, compute and access patterns, only slight performance degradation has been observed and often even faster access.
There are several benefits associated with storing the data in Blob Storage instead of HDFS:
@@ -44,10 +87,18 @@ There are several benefits associated with storing the data in Blob Storage inst
Certain Map-Reduce jobs and packages may create intermediate results that you don't really want to store in the Blob Storage container. In that case, you can still elect to store the data in the local HDFS file system. In fact, HDInsight uses DFS for several of these intermediate results in Hive jobs and other processes.
-##Provision the Default File System
-Hadoop supports a notion of default file system. A user can set the default file system for his HDInsight cluster during the provision process. The default file system implies a default scheme and authority; it can also be used to resolve relative paths.
-When provisioning an HDInsight cluster from Windows Azure Management Portal, there are two options: quick create and custom create. Using either of the options, a Windows Azure Storage account must be created beforehand. For instructions, see How to [Create a Storage Account]( /en-us/manage/services/storage/how-to-create-a-storage-account/).
+
+
+
+##<a id="preparingblobstorage"></a>Preparing Blob Storage Container for ASV
+To use blobs, you first create a [Windows Azure storage account](/en-us/manage/services/storage/how-to-create-a-storage-account/). As part of this, you specify the Windows Azure datacenter that will store the objects you create using this account. Choosing the same datacenter as your HDInsight cluster can improve performance. Wherever it lives, each blob you create belongs to some container in your storage account. This container may be an artitary Blob Storage container created outside of HDInsight, or it may be a container that is created as an ASV file system from HDInsight.
+
+
+
+**Provision the Container Used as the Default File System**
+
+When provisioning an HDInsight cluster from Windows Azure Management Portal, there are two options: *quick create* and *custom create*. Using either of the options, a Windows Azure Storage account must be created beforehand. For instructions, see [How to Create a Storage Account]( /en-us/manage/services/storage/how-to-create-a-storage-account/).
Using the quick create option, you can choose an existing storage account. The provision process will create a new container with the same name as the HDInsight cluster name. This container will be used as the default file system.
@@ -57,107 +108,122 @@ Using the custom create, you can either choose an existing Blob Storage containe
![HDI.CustomCreateStorageAccount](../Media/HDI.CustomCreateStorageAccount.png "Custom Create Storage Account")
-The provision process adds an entry to the C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml file:
- <property>
- <name>fs.azure.account.key.<accountname></name>
- <value><enterthekeyvaluehere></value>
- </property>
-
-Once a Blob Storage container has designated as the default file system for the HDInsight Service, it cannot be changed to a different container.
-##Accessing Files in other Blob Storage
-Other than the Blob Storage container designated as the default file system, you can also access containers that reside in the same Windows Azure storage account or different Windows Azure storage accounts:
-* Container in the same storage account: Because the account name and key are stored in the core-site.xml, you have full access to the files in the container.
-* Container in a different storage account with the public container access level: you have read only permission to the files in the container.
-* Container in a different storage account with the private or public blob access levels: you must add a new entry to the core-site.xml files to be able to access the files in the container:
- <property>
- <name>fs.azure.account.key.<accountname></name>
- <value><enterthekeyvaluehere></value>
- </property>
-
-Be aware that accessing such a container may go outside of your subscription's data center, this may incur additional charges for data flowing across the data center boundaries. It should also always be encrypted using the ASVS URI scheme.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**Using APIs to Create Containers**
Creating an ASV File System can be done by creating a new Blob Storage container through the commonly-used APIs in a storage account for which core-site.xml contains the storage key. In addition, you can also create a new container by referring to it in an HDFS file system command. For example:
- hadoop fs -mkdir asvs://newcontainer@myaccount/newdirectory
+ hadoop fs -mkdir asvs://<newcontainer>@<accountname>.blob.core.windows.net/<newdirectory>
+
+The example command will not only create the new directory *newdirectory* but, if it doesn't exist, will also create a new container called *newcontainer*.
+
+
+
+
+
+
-The example command will not only create the new directory newdirectory but, if it doesn't exist, will also create a new container called newcontainer.
-This container may be a container that has been created as an ASV file system, or it may be just an arbitrary Blob Storage container created outside of HDInsight. In either case you can "mount" it and use it as the default file system. Note that the data will stay in that container and will not be copied to a different location.
-##Addressing Files in Blob Storage
-The URI scheme ([] indicates that the part is optional, <> enclose concepts) for accessing files in Blob Storage is:
+##<a id="addressing"></a>Addressing Files in Blob Storage
- asv[s]://[[<container>@]<domain>]/<path>
+The URI scheme for accessing files in Blob Storage is:
-The URI scheme provides both unencrypted access with the ASV: prefix, and SSL encrypted access with ASVS:. We recommend using ASVS: wherever possible, even when accessing data that lives inside the same Windows Azure data center.
+ asv[s]://[<container>@]<accountname>.blob.core.windows.net/<path>
+
+The URI scheme provides both unencrypted access with the ASV: prefix, and SSL encrypted access with ASVS. We recommend using ASVS wherever possible, even when accessing data that lives inside the same Windows Azure data center.
The &lt;container&gt; identifies the name of the Blob Storage container. If no container name is specified but the domain is, then it refers to the [root container](http://msdn.microsoft.com/en-us/library/windowsazure/ee395424.aspx) of the domain's storage account. Note that root containers are read-only.
-The &lt;domain&gt; identifies the storage account domain. If it does not contain a dot (.), it will be interpreted as &lt;domain&gt;.blob.core.windows.net.
+The &lt;accountname&gt; identifies the storage account name. Fully Qualified Domain Name (FQDN) is required.
-If neither the container nor the domain has been specified, then the default file system is used.
+If neither the container nor the accountname has been specified, then the default file system is used.
The &lt;path&gt; is the file or directory HDFS path name. Since Blob Storage containers are just a key-value store, there is no true hierarchical file system. A / inside an Azure Blob's key is interpreted as a directory separator. Thus, if a blob's key is input/log1.txt, then it is the file log1.txt inside the directory input.
For example:
- asvs://dailylogs@myaccount/input/log1.txt
+ asvs://dailylogs@myaccount.blob.core.windows.net/input/log1.txt
refers to the file log1.txt in the directory input on the Blob Storage container dailylogs at the location myaccount.blob.core.windows.net using SSL.
- asvs://myaccount/result.txt
+ asvs://myaccount.blob.core.windows.net/result.txt
-refers to the file result.txt on the read-only ASV file system in the root container at the location myaccount.blob.core.windows.net that gets accessed through SSL. Note that asv://myaccount/output/result.txt will result in an exception, because Blob Storage does not allow / inside path names in the root container to avoid ambiguities between paths and folder names.
+refers to the file result.txt on the read-only ASV file system in the root container at the location myaccount.blob.core.windows.net that gets accessed through SSL. Note that asv://myaccount.blob.core.windows.net/output/result.txt will result in an exception, because Blob Storage does not allow / inside path names in the root container to avoid ambiguities between paths and folder names.
asv:///output/result.txt
refers to the file result.txt in the output directory on the default file system.
+You must specify the FQDN when using SSL. The following command will return an error:
+
+ asvs:///output/result.txt
+
+Instead, you must use the following command:
+
+ asvs://dailylogs@myaccount.blob.core.windows.net/output/result.txt
+
Because HDInsight uses a Blob Storage container as the default file system, you can refer to files and directories inside the default file system using relative or absolute paths. For example, the following statement will list all top-level directories and files of the default file system:
- hadoop fs -ls /
+ hadoop fs -ls /output/result.txt
-##Mapping an ASV URI to a Blob Storage URI
+**Mapping a Blob Storage URI to an ASV URI**
+
+Given a Blob Storage URI, you may need to be able to create the ASV URI. The mapping is straight forward.
-Given an ASV URI, you may need to be able to create the Blob Storage URI which can be used to access the blob directly in Blob Storage. The mapping is straight forward.
To access an file (or folder) at
-<table>
+<table border=1>
<tr><th>AVS URI</th><th>Blob Storage URI</th></tr>
-<tr><td>asv[s]://&lt;account&gt;/&lt;path-name&gt;</td><td>http[s]://&lt;account&gt;/&lt;path-name&gt;</td></tr>
-<tr><td>asv[s]://&lt;container&gt;@&lt;account&gt;/&lt;path-name&gt;</td><td>http[s]://&lt;account&gt;/&lt;container&gt;/&lt;path-name&gt;</td></tr>
-<tr><td>asv[s]:///&lt;path-name&gt;</td><td>http[s]://&lt;account&gt;/&lt;container&gt;/&lt;path-name&gt;<br/>
-
-where account and container are the values used for specifying the default file system.</td></tr>
-<tr><td>asvs://dailylogs@myaccount/input/log1.txt</td><td>https://myaccount.blob.core.windows.net/dailylogs/input/log1.txt</td></tr>
-</table>
+<tr><td>http[s]://&lt;account&gt;/&lt;path-name&gt;</td><td>asv[s]://&lt;account&gt;.blob.core.windows.net/&lt;path-name&gt;</td></tr>
-##Mapping Files and Directories into Blob Storage Containers
+<tr><td>http[s]://&lt;account&gt;/&lt;container&gt;/&lt;path-name&gt;</td><td>asv[s]://&lt;container&gt;@&lt;account&gt;.blob.core.windows.net/&lt;path-name&gt;</td></tr>
-Blob Storage containers are a key-value store. There is no true hierarchical file system. A / inside an Azure Blob's key is interpreted as a directory separator. Each segment in the blob's key separated by a directory separator implies a directory, or, in the case of the last segment, the file name. For example a blob's key input/log1.txt is the file log1.txt inside the directory input.
+<tr><td>http[s]://&lt;account&gt;/&lt;container&gt;/&lt;path-name&gt;<br/>
-This is also how to map the HDFS file and directory structure back into the Blob Storage container. A file f.txt inside the directories a/b/c will be stored as blob called a/b/c/f.txt inside the Blob Storage container.
+where account and container are the values used for specifying the default file system.</td><td>asv:///&lt;path-name&gt;<br/> asvs://&lt;container&gt;@&lt;account&gt;.blob.core.windows.net/&lt;path-name&gt;</td></tr>
-In order to preserve the POSIX-based semantics of HDFS, you need to add some more information to preserve the presence of folders that were created either explicitly through mkdir or implicitly by creating files inside them. This is achieved by creating a place holder blob for the directory, which is empty, and has two metadata properties that indicate that the blob is an ASV directory (asv_isfolder) and what its permissions are (asv_permissions).
+<tr><td>https://&lt;account&gt;.blob.core.windows.net/dailylogs/input/log1.txt</td><td>asvs://&lt;container&gt;@&lt;account&gt;.blob.core.windows.net/input/log1.txt</td></tr>
-Such folder blobs may also be created in a normal Blob Storage container, if you perform a writing/updating HDFS file command on it such as deleting a file inside a folder, since is it necessary to preserve the semantics that deleting a file will not delete the containing folder.
+</table>
-##Conclusion
+##<a id="nextsteps"></a>Next Steps
In this article, you learned how to use Blob Storage with HDInsight and that Blob Storage is a fundamental component of the HDInsight Service. This will allow you to build scalable, long-term archiving data acquisition solutions with Windows Azure Blob Storage and use HDInsight to unlock the information inside the stored data.
-##Next Steps
-
Now you understand how to use Windows Azure Blob Storage. To learn more, see the following articles:
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
* [How to: Upload data][upload-data]
* [Using Pig with HDInsight][pig]
* [Using Hive with HDInsight][hive]
View
66 ITPro/Services/hdinsight/using-hdinsight-sdk.md
@@ -4,45 +4,47 @@
#Using the HDInsight Service Client Library#
-The HDInsight Service library provides set of .NET client libraries that makes it easier to work with Hadoop in .NET. In this tutorial you will learn how to get the client library and use it to build simple .NET based Hadoop program. To enable the HDInsight preview, click [here](https://account.windowsazure.com/PreviewFeatures).
+The HDInsight Service library provides set of .NET client libraries that makes it easier to work with HDInsight in .NET. In this tutorial you will learn how to get the client library and use it to build simple .NET based Hadoop program to run Hive queries.
+
+To enable the HDInsight preview, click [here](https://account.windowsazure.com/PreviewFeatures).
## In this Article
-* Downloading an installing the library
-* Preparing for the tutorial
-* Executing hive jobs on HDInsight custer from a .NET program
-* Next Steps
+* [Downloading an installing the library](#install)
+* [Preparing for the tutorial](#prepare)
+* [Creating and Runing a .NET program](#create)
+* [Next Steps](#nextsteps)
-## Downloading and Installing the Library##
+##<a id="install"></a> Downloading and Installing the Library##
-You can install latest published build of the library from NuGet. The library includes following components:
+You can install latest published build of the library from [NuGet](http://nuget.codeplex.com/wikipage?title=Getting%20Started). The library includes the following components:
-* A Map/Reduce library - this library simplifies writing map/reduce jobs in .NET languages using the Hadoop streaming interface
-* LINQ to Hive client library – this library translates C# or F# LINQ queries into Hive queries and executes them on the Hadoop cluster. This library can execute arbitrary Hive HQL queries from a .NET program as well.
-* WebClient librarycontains client libraries for WebHDFS and WebHCat
+* **MapReduce library:** This library simplifies writing MapReduce jobs in .NET languages using the Hadoop streaming interface.
+* **LINQ to Hive client library:** This library translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster. This library can execute arbitrary HiveQL queries from a .NET program as well.
+* **WebClient library:** This libarary contains client libraries for *WebHDFS* and *WebHCat*.
- * WebHDFS client library – works with files in HDFS and Blog Storage
- * WebHCat client library – manages scheduling and execution of jobs in Hadoop cluster
+ * **WebHDFS client library:** It works with files in HDFS and Windows Azure Blog Storage
+ * **WebHCat client library:** It manages scheduling and execution of jobs in HDInsight cluster
+
+The NuGet syntax to install the librarys:
-1. Open Visual Studio 2012.
-2. Create a new .NET project or open an existing .NET project.
-3. From the Tools menu, click **Library Package Manager**, click **Package Manager Console**.
-4. Run the following commands in the console to install the packages.
+ install-package Microsoft.Hadoop.MapReduce
+ install-package Microsoft.Hadoop.Hive
+ install-package Microsoft.Hadoop.WebClient
+
+These commands add .NET libraries and references to them to the current Visual Studio project.
- install-package Microsoft.Hadoop.MapReduce –pre
- install-package Microsoft.Hadoop.Hive -pre
- install-package Microsoft.Hadoop.WebClient -pre
+##<a id="prepare"></a> Preparing for the Tutorial
- These commands add .NET libraries and references to them to the current Visual Studio project.
+You must have a [Windows Azure subscription][free-trial], and a [Windows Azure Storage Account][create-storage-account] before you can proceed. You must also know your Windows Azure storage account name and account key. For the instructions on how to get this information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
-## Preparing for the Tutorial
-You must have a [Windows Azure subscription][free-trial], and a [Windows Azure Storage Account][create-storage-account] before you can proceed. You must also know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
+You must also download the Actors.txt file used in this tutorial. Perform the following steps to download this file to your development environment:
1. Create a C:\Tutorials folder on your local computer.
2. Download [Actors.txt](http://www.microsoft.com/en-us/download/details.aspx?id=37003), and save the file to the C:\Tutorials folder.
-##Executing Hive Jobs on HDInsight Cluster from .NET Program
+##<a id="create"></a>Creating and Executing a .NET Program
In this section you will learn how to upload files to Hadoop cluster programmatically and how to execute Hive jobs using LINQ to Hive.
@@ -58,7 +60,18 @@ In this section you will learn how to upload files to Hadoop cluster programmati
</table>
4. Click **OK** to create the project.
-5. Install the NuGet packages for Hive (Microsoft.Hadoop.Hive) and WebClient (Microsoft.Hadoop.WebClient) as described in the “Download and Install the Library” section.
+
+
+3. From the **Tools** menu, click **Library Package Manager**, click **Package Manager Console**.
+4. Run the following commands in the console to install the packages.
+
+ install-package Microsoft.Hadoop.Hive -pre
+ install-package Microsoft.Hadoop.WebClient -pre
+
+ These commands add .NET libraries and references to them to the current Visual Studio project.
+
+
+
6. From Solution Explorer, double-click **Program.cs** to open it.
7. Add the following using statements to the top of the file:
@@ -119,12 +132,13 @@ In this section you will learn how to upload files to Hadoop cluster programmati
9. Press **F5** to run the program.
-#Next Steps
+##<a id="nextsteps"></a>Next Steps
Now you understand how to create a .NET application using HDInsight client SDK. To learn more, see the following articles:
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
* [Using Pig with HDInsight][hdinsight-pig]
* [Using MapReduce with HDInsight][hdinsight-mapreduce]
-* [Using Hive](/en-us/manage/services/hdinsight/using-hive/)
+* [Using Hive with HDInsight](/en-us/manage/services/hdinsight/using-hive/)
[hdinsight-pig]: /en-us/manage/services/hdinsight/using-pig-with-hdinsight/
[hdinsight-mapreduce]: /en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/
View
106 ITPro/Services/hdinsight/using-hive.md
@@ -4,19 +4,27 @@
# Using Hive with HDInsight #
-Hive provides a means of running MapReduce job through an SQL-like scripting language, called *HiveQL*, which can be applied towards summarization, querying, and analysis of large volumes of data. HiveQL enables anyone already familiar with SQL to query the data. HiveQL is intuitive and easy-to-understand, it does require learning a new scripting language.
+Hive provides a means of running MapReduce job through an SQL-like scripting language, called *HiveQL*, which can be applied towards summarization, querying, and analysis of large volumes of data. In this tutorial, you will use HiveQL to query the data in an Apache log4j log file and report basic statistics.
+
+
+[Apache Log4j](http://en.wikipedia.org/wiki/Log4j) is a logging utility. Each log inside a file contains a *log level* field to show the type and the severity. For example:
+
+ 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937
-In this tutorial, you will use HiveQL to query the data in an [Apache log4j](http://en.wikipedia.org/wiki/log4j) log file and report basic statistics. Later in the tutorial, you will see a sample of the log4j log file.
**Estimated time to complete:** 30 minutes
##In this Article
-* The Hive Usage case
-* Procedures
-* Next Steps
+* [The Hive Usage case](#usage)
+* [Upload a sample log4j file to Windows Azure Blob Storage](#uploaddata)
+* [Connect to the interactive console](#connect)
+* [Create a Hive table and upload data to the table](#createhivetable)
+* [Run Hive queries](#runhivequeries)
+* [Tutorial clean up](#cleanup)
+* [Next Steps](#nextsteps)
-## The Hive Usage Case ##
+##<a id="usage"></a>The Hive Usage Case
Databases are great for small sets of data and low latency queries. However, when it comes to Big Data and large data sets in terabytes, traditional SQL databases are not the ideal solution. Traditionally, database administrators have relied on scaling up by buying bigger hardware as database load increases and performance degrades.
@@ -32,25 +40,22 @@ Generally, all applications save errors, exceptions and other coded issues in a
Log files are therefore a good example of big data. Working with big data is difficult using relational databases and statistics/visualization packages. Due to the large amounts of data and the computation of this data, parallel software running on tens, hundreds, or even thousands of servers is often required to compute this data in a reasonable time. Hadoop provides a Hive data warehouse system that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
-##Procedures
-You must have an HDInsight cluster previsioned before you can work on this tutorial. For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight).
-
-HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
-
In this tutorial, you will complete the following tasks:
-* Upload a sample log4j file to Windows Azure Blob Storage
-* Connect to the interactive console
-* Examine the log file in the JavaScript console
-* Create a Hive table and upload data to the table
-* Run Hive queries
-* Tutorial clean up
+ 1 Upload a sample log4j file to Windows Azure Blob Storage
+ 2 Connect to the interactive console
+ 3 Create a Hive table and upload data to the table
+ 4 Run Hive queries
+ 5 Tutorial clean up
+
+
+##<a id="uploaddata"></a>Upload a Sample Log4j File to Windows Azure Blob Storage
+HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
-###Upload a Sample Log4j File to Windows Azure Blob Storage
*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").
-Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
+Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions on creating a Windows Azure Storage account, see [How To Create a Storage Account](/en-us/manage/services/storage/how-to-create-a-storage-account/). For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
1. Download [sample.log](http://go.microsoft.com/fwlink/?LinkID=37003 "Sample.log") to your local computer.
@@ -118,9 +123,9 @@ Before using the tool, you must know your Windows Azure storage account name and
12. Click **Close**.
13. From the **File** menu, click **Exit** to close Azure Storage Explorer.
-### Connect to the Interactive Console
+##<a id="connect"></a> Connect to the Interactive Console
-In this task, you will start hive, create an external table, and load sample.log data into the table.
+You must have an HDInsight cluster previsioned before you can work on this tutorial. To enable the Windows Azure HDInsight Service preview, click [here](https://account.windowsazure.com/PreviewFeatures). For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight/) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/).
1. Sign in to the [Management Portal](https://manage.windowsazure.com).
2. Click **HDINSIGHT**. You shall see a list of deployed Hadoop clusters.
@@ -131,46 +136,62 @@ In this task, you will start hive, create an external table, and load sample.log
![HDI.TileInteractiveConsole](../media/HDI.TileInteractiveConsole.png "Interactive Console")
-### Examine the Log File in the JavaScript Console
-1. If you are not in the JavaScript console, click **JavaScript** on the upper right corner.
+1. Click **JavaScript** on the upper right corner to open the Interactive JavaScript console.
2. Run the following commands to list the files in the default file system and display the content of sample.log:
- #ls asv:///
- #cat asv:///sample.log
-
+ #ls asv:///
+ #cat asv:///sample.log
+
![HDI.JavaScriptConsole](../media/HDI.IJCListFile.png "Interactive JavaScript Console")
+ The asv syntax is for listing the files in the default file system. To access files in other containers, use the following syntax:
+
+ #ls asv[s]://[[<container>@]<storagename..blob.core.windows.net]/<path>
+ For example, you can list the same file using the following command:
-### Create a Hive Table and Upload Data to the Table
+ #ls asv://container@storagename.blob.core.windows.net/sample.log
+
+ replace *container* with the container name, and *storagename* with the Storage account name.
+
+ <p>Because the file is located on the default file system, the same result can also be retrieved by using the following command:
+
+ #ls /sample.log
+
+ To use asvs, you must provide the FQDN. For example to access sample.log on the default file system:
+
+ #ls asvs://container@storagename.blob.core.microsoft.net/sample.log 
+
+
+##<a id="createhivetable"></a> Create a Hive Table and Upload Data to the Table
1. Click the **Hive** button on the upper right corner. The Hive console looks like:
![HDI.HiveConsole](../media/HDI.HiveConsole.png "Hive Console")
-1. In the Hive Query pane, enter the following Hive query to create a table named log4jlogs, and then click **Evaluate**.
+2. In the Hive Query pane, enter the following Hive query to create a table named log4jlogs, and then click **Evaluate**.
CREATE TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
- **Note:** The command is terminated by two single quotes with a space in between.
+ Note the command is terminated by two single quotes with a space in between.
-3. Enter the following Hive query to load the sample.log data into the logs table you just created, replace **storageaccount** and **domainname**, and then click **Evaluate**.
+3. Enter the following Hive query to load the sample.log data into the logs table you just created, replace **container** and **storagename**, and then click **Evaluate**.
- LOAD DATA LOCAL INPATH 'asv://storageaccount@domainname.blob.core.windows.net/sample.log' OVERWRITE INTO TABLE log4jLogs;
-
-### Run Hive Queries
-1. Run the following query to return the count of lines in the data:
+ LOAD DATA INPATH 'asv://container@storagename.blob.core.windows.net/sample.log' OVERWRITE INTO TABLE log4jLogs;
+
+##<a id="runhivequeries"></a> Run Hive Queries
+1. Enter the following query, and then click **Evaluate**. The query returns the count of lines in the data:
SELECT COUNT(*) FROM log4jLogs
-2. Run the following query to return the count of errors from the structured data:
+2. Enter the following query, and then click **Evaluate**. The query returns the count of errors from the structured data:
SELECT t4 AS sev, COUNT(*) AS cnt FROM log4jLogs WHERE t4 = '[ERROR]' GROUP BY t4
-## Tutorial Clean Up ##
+##<a id="cleanup"></a> Tutorial Clean Up
The clean up task applies to this tutorial only; it is not necessarily performed in an actual deployment. In this task, you will delete the table and the data so that if you like, you can run the tutorial again.
-1. Delete the table logs:
+1. From the Hive console, delete the table log4jLogs:
drop table log4jLogs;
@@ -178,12 +199,11 @@ The clean up task applies to this tutorial only; it is not necessarily performed
Congratulations! You have successfully completed this tutorial.
-##Next Steps
+##<a id="nextsteps"></a>Next Steps
While Hive makes it easy to query data using a SQL-like query language, other languages available with the HDInsight Service provide complementary functionality such as data movement and transformation. To learn more, see the following articles:
-* [Using Pig with HDInsight][hdinsight-pig]
-* [Using MapReduce with HDInsight][hdinsight-mapreduce]
-
-[hdinsight-pig]: /en-us/manage/services/hdinsight/using-pig-with-hdinsight/
-[hdinsight-mapreduce]: /en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
+* [Using MapReduce with HDInsight](/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/)
+* [Using Pig with HDInsight](/en-us/manage/services/hdinsight/using-pig-with-hdinsight/) 
+* [How to Run the HDInsight Samples](/en-us/manage/services/hdinsight/howto-run-samples/)
View
282 ITPro/Services/hdinsight/using-mapreduce.md
@@ -4,48 +4,13 @@
# Using MapReduce with HDInsight#
-In this tutorial, you will execute a Hadoop MapReduce job to process a semi-structured Apache *log4j* log file on a Windows Azure HDInsight cluster. [Apache Log4j](http://en.wikipedia.org/wiki/Log4j) is a Java-based logging utility. Each log inside a file contains a log level field to show the log level type and the severity. This MapReduce job takes a log4j log file as input, and generates an output file that contains the log level along with its frequency count.
-
-**Estimated time to complete:** 30 minutes
-
-## In this Article
-* Big Data and Hadoop MapReduce
-* Procedures
-* Next Steps
-
-## Big Data and Hadoop MapReduce
-Generally, all applications save errors, exceptions and other coded issues in a log file. These log files can get quite large in size, containing a wealth of data that must be processed and mined. Log files are a good example of big data. Working with big data is difficult using relational databases with statistics and visualization packages. Due to the large amounts of data and the computation of this data, parallel software running on tens, hundreds, or even thousands of servers is often required to compute this data in a reasonable time. Hadoop provides a MapReduce framework for writing applications that process large amounts of structured and semi-structured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.
-
-The following figure is the visual representation of what you will accomplish in this tutorial:
-
-![HDI.VisualObjective](../media/HDI.VisualObject.gif "Visual Objective")
-
-
-The input data consists of a semi-structured log4j file:
+Hadoop MapReduce is a software framework for writing applications which process vast amounts of data. In this tutorial, you will create a Haddop MapReduce job in Java, and execute the job on a Windows Azure HDInsight cluster to process a semi-structured Apache *log4j* log file stored in Azure Storage Vault (Azure Storage Vault or ASV provides a full featured HDFS file system over Windows Azure Blob storage).
+[Apache Log4j](http://en.wikipedia.org/wiki/Log4j) is a logging utility. Each log inside a file contains a *log level* field to show the type and the severity. For example:
2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937
- java.lang.Exception: 2012-02-03 20:26:41 SampleClass9 [ERROR] incorrect format for id 324411615
- at com.osa.mocklogger.MockLogger#2.run(MockLogger.java:83)
- 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434
- 2012-02-03 20:26:41 SampleClass1 [DEBUG] detail for id 903114158
- 2012-02-03 20:26:41 SampleClass8 [TRACE] verbose detail for id 1331132178
- 2012-02-03 20:26:41 SampleClass8 [INFO] everything normal for id 1490351510
- 2012-02-03 20:32:47 SampleClass8 [TRACE] verbose detail for id 1700820764
- 2012-02-03 20:32:47 SampleClass2 [DEBUG] detail for id 364472047
- 2012-02-03 20:32:47 SampleClass7 [TRACE] verbose detail for id 1006511432
- 2012-02-03 20:32:47 SampleClass4 [TRACE] verbose detail for id 1252673849
- 2012-02-03 20:32:47 SampleClass0 [DEBUG] detail for id 881008264
- 2012-02-03 20:32:47 SampleClass0 [TRACE] verbose detail for id 1104034268
- 2012-02-03 20:32:47 SampleClass6 [TRACE] verbose detail for id 1527612691
- java.lang.Exception: 2012-02-03 20:32:47 SampleClass7 [WARN] problem finding id 484546105
- at com.osa.mocklogger.MockLogger#2.run(MockLogger.java:83)
- 2012-02-03 20:32:47 SampleClass0 [DEBUG] detail for id 2521054
- 2012-02-03 21:05:21 SampleClass6 [FATAL] system problem at id 1620503499
-
-In the square brackets are the log levels. For example *[DEBUG]*, *[FATAL]*.
-
-The output data will be put into a file showing the various log4j log levels along with its frequency occurrence:
+
+This MapReduce job takes a log4j log file as input, and generates an output file that contains the log level along with its frequency count. The following is a sample output file:
[TRACE] 8
[DEBUG] 4
@@ -54,18 +19,115 @@ The output data will be put into a file showing the various log4j log levels alo
[ERROR] 1
[FATAL] 1
-## Procedures
+**Estimated time to complete:** 30 minutes
+
+## In this Article
+* [Big Data and Hadoop MapReduce](#mapreduce)
+* [Upload a sample log4j file to the blob storage](#uploaddata)
+* [Connect to an HDInsight Cluster](#connect)
+* [Create a MapReduce job](#createjob)
+* [Run the MapReduce job](#runjob)
+* [Tutorial Clean Up](#cleanup)
+* [Next Steps](#nextsteps)
+
+##<a id="mapreduce"></a> Big Data and Hadoop MapReduce
+Generally, all applications save errors, exceptions and other coded issues in a log file. These log files can get quite large in size, containing a wealth of data that must be processed and mined. Log files are a good example of big data. Working with big data is difficult using relational databases with statistics and visualization packages. Due to the large amounts of data and the computation of this data, parallel software running on tens, hundreds, or even thousands of servers is often required to compute this data in a reasonable time. Hadoop provides a MapReduce framework for writing applications that process large amounts of structured and semi-structured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.
+
+The following figure is the visual representation of what you will accomplish in this tutorial:
+
+![HDI.VisualObjective](../media/HDI.VisualObject.gif "Visual Objective")
+
You will complete the following tasks in this tutorial:
-1. Connect to an HDInsight Cluster
-2. Import data into HDFS
+1. Upload a sample log4j file to the blob storage
+2. Connect to an HDInsight Cluster
3. Create a MapReduce job
4. Run the MapReduce job
5. Tutorial Clean Up
+###<a id="uploaddata"></a>Upload a Sample Log4j File to the Blob Storage
+
+HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
+
+*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").
+
+Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions on creating a Windows Azure Storage account, see [How To Create a Storage Account](/en-us/manage/services/storage/how-to-create-a-storage-account/). For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
+
+1. Download [sample.log](http://go.microsoft.com/fwlink/?LinkID=37003 "Sample.log"), a sample log4j log file, to your local computer.
+
+2. Run **Azure Storage Explorer**.
+
+ ![HDI.AzureStorageExplorer](../media/HDI.AzureStorageExplorer.png "Azure Storage Explorer")
+
+3. Click **Add Account** if the Windows Azure storage account has not been added to the tool.
+
+ ![HDI.ASEAddAccount](../media/HDI.ASEAddAccount.png "Add Account")
+
+4. Enter **Storage account name** and **Storage account key**, and then click **Add Storage Account**.
+5. From **Storage Type**, click **Blobs** to display the Windows Azure Blob storage of the account.
+
+ ![HDI.ASEBlob](../media/HDI.ASEUploadFile.png "Azure Storage Explorer")
+
+6. From **Container**, click the container that is designated as the default file system. The default name is the HDInsight cluster name. You shall see the folder structure of the container.
-### Connect to an HDInsight Cluster
-You must have an HDInsight cluster previsioned before you can work on this tutorial. For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight/) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/).
+ <div class="dev-callout">
+ <b>Note</b>
+ <p>To simplify the tutorial, you will use the default file system. You can also use other containers on the same storage account or other storage accouns. For more information, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store/).</p>
+ </div>
+
+7. From **Blob**, click **Upload**.
+8. Browse to the sample.log file you just downloaded, and the click **Open**. You shall see the sample.log file listed there.
+9. Double-click the sample.log file to open it.
+11. Click **Text** to switch to the tab, so you can view the content of the file. Notice that the following screen output shows a snippet of sample.log where the data follows a particular structure (except the row that starts with “java.lang.Exception…”).
+
+ ![Sample.log](../media/HDI.SampleLog.png)
+
+ Starting from left to right, the structured data rows have a *date* in column 1, *timestamp* in column 2, *class name* in column 3, *log level* in column 4, and so on.
+
+ The row starting with “java.lang.Exception” does not follow this “well-formed” data structure and is therefore, considered unstructured. The following table shows the key differences between the structured rows and unstructured rows.
+
+
+ <table border="1">
+ <tr>
+ <td>
+ Data Type
+ </td>
+ <td>
+ Date Column
+ </td>
+ <td>
+ Severity Column
+ </td>
+ </tr>
+ <tr>
+ <td>
+ Structured
+ </td>
+ <td>
+ 1
+ </td>
+ <td>
+ 4
+ </td>
+ </tr>
+ <tr>
+ <td>
+ Unstructured
+ </td>
+ <td>
+ 2
+ </td>
+ <td>
+ 5
+ </td>
+ </tr>
+ </table>
+12. Click **Close**.
+13. From the **File** menu, click **Exit** to close Azure Storage Explorer.
+
+
+##<a id="connect"></a>Connect to an HDInsight Cluster
+You must have an HDInsight cluster previsioned before you can work on this tutorial. To enable the Windows Azure HDInsight Service preview, click [here](https://account.windowsazure.com/PreviewFeatures). For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight/) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/).
1. Sign in to the [Management Portal](https://manage.windowsazure.com).
2. Click **HDINSIGHT** on the left. You shall see a list of deployed Hadoop clusters.
@@ -76,65 +138,50 @@ You must have an HDInsight cluster previsioned before you can work on this tutor
7. Click **Yes**.
8. Enter your credentials, and then press **ENTER**.
9. From Desktop, click **Hadoop Command Line**. You will use Hadoop command prompt to execute all of the commands in this tutorial. Most of the commands can be run from the [Interactive JavaScript Console](/en-us/manage/services/hdinsight/interactive-javascript-and-hive-consoles/).
+10. Run the following command to list and verify the sample.log file you uploaded to Azure Storage Vault (ASV):
-### Import data into HDFS
+ hadoop fs -ls asv:///sample.log
-MapReduce job can read data from either *Hadoop Distributed File system (HDFS)* or *Windows Azure Blob Storage*. For more information, see [How to: Upload data to HDInsight](/en-us/manage/services/hdinsight/howto-upload-data-to-hdinsight/). In this task, you will place the log file data into HDFS where MapReduce will read it and run the job.
+ The asv syntax is for listing the files in the default file system. To access files in other containers, use the following syntax:
-1. From Hadoop command prompt, run the following commands to create a directory on the C drive:
+ hadoop fs -ls asv[s]://[[<container>@]<storagename>.blob.core.windows.net]/<path>
- c:
- cd \
- mkdir Tutorials
+ For example, you can list the same file using the following command:
-2. Run the following commands to create log file in the C:\Tutorials folder:
+ hadoop fs -ls asv://container@storagename.blob.core.windows.net/sample.log
- cd \Tutorials
- notepad sample.log
-
- You can also download the [sample.log](http://go.microsoft.com/fwlink/?LinkID=286223 "Sample.log") file and put it into the C:\Tutorials folder.
+ replace *container* with the container name, and *storagename* with the Blob Storage account name.
-3. Click **Yes** to create a new file.
-4. Copy and Paste the input data shown earlier in the article into Notepad.
-5. Press **CTRL+S** to save the file, and then close Notepad.
+ Because the file is located on the default file system, the same result can also be retrieved by using the following command:
-5. From Hadoop command prompt, create an input directory in HDFS:
-
- hadoop fs -mkdir Tutorials/input/
-
-6. Verify that the input directory has been created in the Hadoop file system:
-
- hadoop fs -ls Tutorials/
-
- ![](../media/HDI-MR3.png)
-
-3. Load the sample.log input file into HDFS, and create the input directory:
-
- hadoop fs -put sample.log Tutorials/input/
-
-4. Verify that the sample.log has been loaded into HDFS:
+ hadoop fs -ls /sample.log
+
+ To use asvs, you must provide the FQDN. For example to access sample.log on the default file system:
- hadoop fs -ls Tutorials/input/
+ #ls asvs://container@storagename.blob.core.microsoft.net/sample.log
- ![](../media/HDI-MR4.png)
+
-### Create the MapReduce job ##
-The Java programming language is used in this sample. Hadoop Streaming allows developers to use virtually any programming language to create MapReduces jobs. For a Hadoop Streaming sample using C#, see [Hadoop on Windows Azure - Working With Data](/en-us/develop/net/tutorials/hadoop-and-data/).
+##<a id="createjob"></a> Create the MapReduce job ##
+The Java programming language is used in this sample. Hadoop Streaming allows developers to use virtually any programming language to create MapReduces jobs.
-1. From Hadoop command prompt, run the following commands to change directory to the C:\Tutorials folder:
+1. From Hadoop command prompt, run the following commands to make a directory and change directory to the folder:
- c: [ENTER]
- cd \Tutorials [ENTER]
+ mkdir c:\Tutorials
+ cd \Tutorials
2. run the following command to create a java file in the C:\Tutorials folder:
- notepad log4jMapReduce.java [ENTER]
+ notepad log4jMapReduce.java
- Note: the class name is hard-coded in the program. If you want to change the file name, you must update the java program accordingly.
+ <div class="dev-callout">
+ <b>Note</b>
+ <p>The class name is hard-coded in the program. If you want to change the file name, you must update the java program accordingly.</p>
+ </div>
-4. Click **Yes** to create a new file.
+3. Click **Yes** to create a new file.
-5. Copy and paste the following java program into the Notepad window.
+4. Copy and paste the following java program into the Notepad window.
import java.io.IOException;
import java.util.Iterator;
@@ -211,7 +258,7 @@ The Java programming language is used in this sample. Hadoop Streaming allows de
public static void main(String[] args) throws Exception
{
//Code to create a new Job specifying the MapReduce class
- final JobConf conf = new JobConf(log4JMapReduce.class);
+ final JobConf conf = new JobConf(log4jMapReduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
@@ -234,7 +281,8 @@ The Java programming language is used in this sample. Hadoop Streaming allows de
}
}
-6. Press **CTRL+S** to save the file.
+5. Press **CTRL+S** to save the file.
+6. Close Notepad.
7. Compile the java file using the following command:
@@ -246,17 +294,16 @@ The Java programming language is used in this sample. Hadoop Streaming allows de
C:\apps\dist\java\bin\jar -cvf log4jMapReduce.jar *.class
-Notice the results before and after executing the jar command, including verifying the existence of the sample.log file in the non-HDFS directory structure (used later).
-
-![](../media/HDI-MR1.png)
-
-![](../media/HDI-MR2.png)
+ After executing the jar command, you will have the following files in the C:\Tutorials directory:
+ log4jMapReduce$Map.class
+ log4jMapReduce$Reduce.class
+ log4jMapReduce.class
+ log4jMapReduce.jar
+ log4jMapReduce.java
-
-
-### Run the MapReduce job
-Until now, you have uploaded a log4j log files to HDFS, and compiled the MapReduce job. The next step is to run the job.
+##<a id="runjob"></a> Run the MapReduce job
+Until now, you have uploaded a log4j log files to the Blob storage, and compiled the MapReduce job. The next step is to run the job.
1. From Hadoop command prompt, execute the following command to run the Hadoop MapReduce job:
@@ -268,35 +315,36 @@ Until now, you have uploaded a log4j log files to HDFS, and compiled the MapRedu
- Specifying the jar file (log4jMapReduce.jar)
- Indicating the class file (log4jMapReduce)
- Specifying the input file (Tutorials/input/sample.log), and output directory (Tutorials/output)
- - Running the MapReduce job
+ - Running the MapReduce job
+
+ ![HDI.MapReduceResults](../media/HDI.MapReduceResults.png "MapReduce job output")
- The Reduce programs begin to process the data when the Map programs are 100% complete. Prior to that, the Reducer(s) queries the Mappers for intermediate data and gathers the data, but waits to process. This is shown in the following screenshot.
-
- ![](../media/HDI-MR5.png)
+ The Reduce programs begin to process the data when the Map programs are 100% complete. Prior to that, the Reducer(s) queries the Mappers for intermediate data and gathers the data, but waits to process.
- The next screen output shows Reduce input records (that correspond to the six log levels) and Map output records (that contain key value pairs). As you can see, the Reduce program condensed the set of intermediate values that share the same key (DEBUG, ERROR, FATAL, and so on) to a smaller set of values.
-
- ![](../media/HDI-MR6.png)
+ There are 6 Reduce input records (that correspond to the six log levels), and 135 Map output records (that contain key value pairs). The Reduce program condensed the set of intermediate values that share the same key (DEBUG, ERROR, FATAL, and so on) to a smaller set of values.
2. View the output of the MapReduce job in HDFS:
hadoop fs -cat Tutorials/output/part-00000
-
- **Note:** By default, Hadoop creates files begin with the following naming convention: “part-00000”. Additional files created by the same job will have the number increased.
- After executing the command, you should see the following output:
-
- ![](../media/HDI-MR7.png)
-
+ By default, Hadoop creates files begin with the following naming convention: “part-00000”. Additional files created by the same job will have the number increased.The output look like:
+
+ DEBUG 434
+ ERROR 6
+ FATAL 2
+ INFO 96
+ TRACE 816
+ WARN 11
+
Notice that after running MapReduce that the data types are now totaled and in a structured format.
-### Tutorial Clean Up ##
+##<a id="cleanup"></a> Tutorial Clean Up ##
The clean up task applies to this tutorial only; it is not performed in the actual deployment. In this task, you will delete input and output directories so that if you like, you can run the tutorial again.
-1. Delete the input directory and recursively delete files within the directory:
+1. Delete the sample.log file:
- hadoop fs -rmr Tutorials/input/
+ hadoop fs -rm asv:///sample.log
2. Delete the output directory and recursively delete files within the directory:
@@ -304,13 +352,11 @@ The clean up task applies to this tutorial only; it is not performed in the actu
Congratulations! You have successfully completed this tutorial.
-##Next Steps
+##<a id="nextsteps"></a>Next Steps
While MapReduce provides powerful diagnostic abilities, it can be a bit challenging to master. Other languages such as Pig and Hive provide an easier way to work with data stored in your HDInsight Service. To learn more, see the following articles:
-* [Using Pig with HDInsight][hdinsight-pig]
-
-* [Using Hive with HDInsight][hdinsight-hive]
-
-[hdinsight-pig]: /en-us/manage/services/hdinsight/using-pig-with-hdinsight/
-[hdinsight-hive]: /en-us/manage/services/hdinsight/using-hive-with-hdinsight/
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
+* [Using Hive with HDInsight](/en-us/manage/services/hdinsight/using-hive-with-hdinsight/)
+* [Using Pig with HDInsight](/en-us/manage/services/hdinsight/using-pig-with-hdinsight/)
+* [How to Run the HDInsight Samples](/en-us/manage/services/hdinsight/howto-run-samples/)
View
98 ITPro/Services/hdinsight/using-pig.md
@@ -11,20 +11,25 @@
- **Transform**: Manipulate the data
- **Dump or store**: Output data to the screen or store for processing
-In this tutorial, you will write Pig Latin statements to analyze an [Apache log4j](http://en.wikipedia.org/wiki/Log4j) log file, and run various queries on the data to generate output. This tutorial demonstrates the advantages of Pig, and how it can be used to simplify MapReduce jobs.
+In this tutorial, you will write Pig Latin statements to analyze an Apache log4j log file, and run various queries on the data to generate output. This tutorial demonstrates the advantages of Pig, and how it can be used to simplify MapReduce jobs.
-**Estimated time to complete:** 30 minutes
-
-#In this Article
+[Apache Log4j](http://en.wikipedia.org/wiki/Log4j) is a logging utility. Each log inside a file contains a *log level* field to show the type and the severity. For example:
-- The Pig usage case
-- Procedures
-- Next Steps
-
+ 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937
+**Estimated time to complete:** 30 minutes
+##In this Article
-## The Pig Usage Case
+* [The Pig usage case](#usage)
+* [Upload a sample log4j file to Windows Azure Blob Storage](#uploaddata)
+* [Connect to your HDInsight cluster](#connect)
+* [Use Pig in the interactive mode](#interactivemode)
+* [Use Pig in the batch mode](#batchmode)
+* [Tutorial clean up](#cleanup)
+* [Next Steps](#nextsteps)
+
+##<a id="usage"></a>The Pig Usage Case
Databases are great for small sets of data and low latency queries. However, when it comes to Big Data and large data sets in terabytes, traditional SQL databases are not the ideal solution. As database load increases and performance degrades, historically, database administrators have had to buy bigger hardware.
Generally, all applications save errors, exceptions and other coded issues in a log file, so administrators can review the problems, or generate certain metrics from the log file data. These log files usually get quite large in size, containing a wealth of data that must be processed and mined.
@@ -44,9 +49,7 @@ Figure 2: Data Transformation:
![Data Transformation](../media/HDI.DataTransformation.png)
-
-## Procedure
-You will perform the following tasks:
+You will complete the following tasks in this tutorial:
* Upload a sample log4j file to Windows Azure Blob Storage
* Connect to your HDInsight cluster
@@ -54,7 +57,10 @@ You will perform the following tasks:
* Use Pig in the batch mode
* Tutorial clean up
-###Upload a Sample Log4j File to Windows Azure Blob Storage
+##<a id="uploaddata"></a>Upload a Sample Log4j File to Windows Azure Blob Storage
+
+HDInsight provides two options for storing data, Windows Azure Blob Storage and Hadoop Distributed File system (HDFS). For more information on choosing file storage, see [Using Windows Azure Blob Storage with HDInsight](/en-us/manage/services/hdinsight/using-blob-store). When you provision an HDInsight cluster, the provision process creates a Windows Azure Blob storage container as the default HDInsight file system. To simplify the tutorial procedures, you will use this container for storing the log4j file.
+
*Azure Storage Explorer* is a useful tool for inspecting and altering the data in your Windows Azure Storage. It is a free tool that can be downloaded from [http://azurestorageexplorer.codeplex.com/](http://azurestorageexplorer.codeplex.com/ "Azure Storage Explorer").
Before using the tool, you must know your Windows Azure storage account name and account key. For the instructions for get the information, see the *How to: View, copy and regenerate storage access keys* section of [How to Manage Storage Accounts](/en-us/manage/services/storage/how-to-manage-a-storage-account/).
@@ -82,8 +88,9 @@ Before using the tool, you must know your Windows Azure storage account name and
12. Click **Close**.
13. From the **File** menu, click **Exit** to close Azure Storage Explorer.
-### Connect to your HDInsight Cluster ##
+##<a id="connect"></a>Connect to your HDInsight Cluster ##
+You must have an HDInsight cluster previsioned before you can work on this tutorial. To enable the Windows Azure HDInsight Service preview, click [here](https://account.windowsazure.com/PreviewFeatures). For information on prevision an HDInsight cluster see [How to Administer HDInsight Service](/en-us/manage/services/hdinsight/howto-administer-hdinsight/) or [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/).
1. Sign in to the [Management Portal](https://manage.windowsazure.com).
2. Click **HDINSIGHT**. You shall see a list of deployed Hadoop clusters.
@@ -94,7 +101,7 @@ Before using the tool, you must know your Windows Azure storage account name and
10. Click **Yes**.
11. From Desktop, double-click **Hadoop Command Line**.
-### Use Pig in the Interactive Mode ##
+##<a id="interactivemode"></a> Use Pig in the Interactive Mode ##
First, you will use Pig Latin in interactive mode (Grunt shell) to analyze a single log file, and then you will use Pig in batch mode (script) to perform the same task.
@@ -107,6 +114,14 @@ First, you will use Pig Latin in interactive mode (Grunt shell) to analyze a sin
grunt> LOGS = LOAD 'asv:///sample.log';
+ <div class="dev-callout"> 
+ <b>Note</b> 
+ <p>To use asvs, you must provide the FQDN. For example: <br/>
+LOG = LOAD 'asvs://container@storagename.blob.core.microsoft.net/sample.log'</p> 
+ </div>
+
+
+
3. Show the content:
grunt> dump LOGS;
@@ -233,7 +248,7 @@ First, you will use Pig Latin in interactive mode (Grunt shell) to analyze a sin
grunt> quit;
-### Use Pig in the Batch Mode ##
+##<a id="batchmode"></a>Use Pig in the Batch Mode ##
Next, you will use Pig in batch mode by creating a Pig script made up of the same Pig commands you used in the last task.
@@ -243,20 +258,20 @@ Next, you will use Pig in batch mode by creating a Pig script made up of the sam
2. Copy and paste the following Pig commands to the pigscript.pig file, and save:
- - load the log file
- LOGS = LOAD 'asv:///sample.log';
- - iterate through each line and match on the 6 log levels
- LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;
- - filter out non-match rows, i.e. empty rows
- FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
- - now group/consolidate all of the log levels into their own row, counting is not done yet
- GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
- - for each group, now count the occurrences of log levels which will be the frequencies of each log level
- FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;
- - sort the frequencies in descending order
- RESULT = order FREQUENCIES by COUNT desc;
- - write the result to a file
- store RESULT into 'sampleout';
+ - load the log file
+ LOGS = LOAD 'asv:///sample.log';
+ - iterate through each line and match on the 6 log levels
+ LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;
+ - filter out non-match rows, i.e. empty rows
+ FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
+ - now group/consolidate all of the log levels into their own row, counting is not done yet
+ GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
+ - for each group, now count the occurrences of log levels which will be the frequencies of each log level
+ FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;
+ - sort the frequencies in descending order
+ RESULT = order FREQUENCIES by COUNT desc;
+ - write the result to a file
+ store RESULT into 'sampleout';
3. Save the file and close Notepad.
@@ -272,7 +287,7 @@ Next, you will use Pig in batch mode by creating a Pig script made up of the sam
The results are the same as what you got in the interactive mode.
-### Tutorial Clean Up ##
+##<a id="cleanup"></a>Tutorial Clean Up ##
In this task, you will delete input and output directories so that if you like, you can run the tutorial again.
@@ -282,22 +297,13 @@ In this task, you will delete input and output directories so that if you like,
Congratulations! You have successfully completed this tutorial.
-##Next Steps
+##<a id="nextsteps"></a>Next Steps
While Pig allows you to perform data analysis, other languages included with the HDInsight Service may be of interest to you also. Hive provides a SQL-like query language that allows you to easily query against data stored in HDInsight, while MapReduce jobs written in Java allow you to perform complex data analysis. For more information, see the following:
-* [Using Hive with HDInsight][hdinsight-hive]
-* [Using MapReduce with HDInsight][hdinsight-mapreduce]
-
-
-[hdinsight-mapreduce]: /en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/
-[hdinsight-hive]: /en-us/manage/services/hdinsight/using-hive-with-hdinsight/
-
-
-
-
-
-
-
-
+* [Getting Started with Windows Azure HDInsight Service](/en-us/manage/services/hdinsight/get-started-hdinsight/)
+* [Using MapReduce with HDInsight](/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/)
+* [Using Hive with HDInsight](/en-us/manage/services/hdinsight/using-hive-with-hdinsight/)
+* [Using Pig with HDInsight](/en-us/manage/services/hdinsight/using-pig-with-hdinsight/) 
+* [How to Run the HDInsight Samples](/en-us/manage/services/hdinsight/howto-run-samples/)
View
BIN  ITPro/Services/media/ASV-files.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.ASEBlob.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.ASEUploadFile.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.ASVSample.PNG
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.ClusterSummary.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.CustomCreateStorageAccount.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.Dashboard1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.HadoopCommandLine.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.HiveConsole.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.IJCListFile.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.InteractiveJavaScriptConsole.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.JobHistoryPage.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.MapReduceResults.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.MonitorPage.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.QuickCreate.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  ITPro/Services/media/HDI.fsput.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Please sign in to comment.
Something went wrong with that request. Please try again.