Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Update machine-learning-walkthrough-2-upload-data.md #1938

Merged
merged 1 commit into from

3 participants

Sean Maguire Azure Contribution License Agreements C.J. Gronlund
Sean Maguire

Did editorial cleanup, updated name of CloudML Studio to ML Studio.

Sean Maguire smaguiremsft Update machine-learning-walkthrough-2-upload-data.md
Did editorial cleanup, updated name of CloudML Studio to ML Studio.
2e2f05a
Azure Contribution License Agreements
Collaborator

Hi smaguiremsft,

Thanks for your contribution!

In order for us to be able to evaluate and accept your contribution, we ask that you sign a contribution license agreement.
Please sign an electronic agreement at http://azurecla.azurewebsites.net/ .

Thanks,
The Azure Team

C.J. Gronlund
Owner

@smaguiremsft
Thanks for this contribution to Azure documentation. I've submitted this PR to the article writer for review.
--Carolyn

C.J. Gronlund cjgronlund merged commit 2e2f05a into from
C.J. Gronlund
Owner

@smaguiremsft
Thanks for this excellent contribution. I did a command-line merge to accept this PR here: #2005
Thank you,
Carolyn
Azure Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 23, 2014
  1. Sean Maguire

    Update machine-learning-walkthrough-2-upload-data.md

    smaguiremsft authored
    Did editorial cleanup, updated name of CloudML Studio to ML Studio.
This page is out of date. Refresh to see the latest.
Showing with 15 additions and 15 deletions.
  1. +15 −15 articles/machine-learning-walkthrough-2-upload-data.md
30 articles/machine-learning-walkthrough-2-upload-data.md
View
@@ -28,18 +28,18 @@ To develop a predictive model for credit risk, we’ll use the “UCI Statlog (G
We’ll use the file named **german.data**. Download this file to your local hard drive.
-This dataset contains rows of 20 variables for 1000 past applicants for credit. These 20 variables represent the dataset’s feature vector which provides identifying characteristics for each credit applicant. An additional column in each row represents the applicant’s credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.
+This dataset contains rows of 20 variables for 1000 past applicants for credit. These 20 variables represent the dataset’s feature vector, which provides identifying characteristics for each credit applicant. An additional column in each row represents the applicant’s credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.
-The UCI website provides a description of the attributes of the feature vector which include financial information, credit history, employment status, and personal information. For each applicant a binary rating has been given indicating whether they are a low or high credit risk.
+The UCI website provides a description of the attributes of the feature vector, which include financial information, credit history, employment status, and personal information. For each applicant, a binary rating has been given indicating whether they are a low or high credit risk.
We’ll use this data to train a predictive analytics model. When we’re done, our model should be able to accept information for new individuals and predict whether they are a low or high credit risk.
Here’s one interesting twist. The description of the dataset explains that misclassifying a person as a low credit risk when they are actually a high credit risk is 5 times more costly to the financial institution than misclassifying a low credit risk as high. One simple way to take this into account in our experiment is by duplicating (5 times) those entries that represent someone with a high credit risk. Then, if the model misclassifies a high credit risk as low, it will do that misclassification 5 times, once for each duplicate. This will increase the cost of this error in the training results.
##Convert the dataset format
-The original dataset uses a blank-separated format. CloudML Studio works better with a comma-separated (CSV) file, so we’ll convert the dataset by replacing spaces with commas.
+The original dataset uses a blank-separated format. ML Studio works better with a comma-separated (CSV) file, so we’ll convert the dataset by replacing spaces with commas.
-We can do this using the following PowerShell command:
+We can do this using the following Windows PowerShell command:
cat german.data | %{$_ -replace " ",","} | sc german.csv
@@ -48,16 +48,16 @@ We can also do this using the Unix sed command:
sed 's/ /,/g' german.data > german.csv
##Upload the dataset to ML Studio
-Once the data has been converted to CSV format, we need to upload it into CloudML Studio.
-
-1. In CloudML Studio, click **+NEW** at the bottom of the window
-2. Select **DATASET**
-3. Select **FROM LOCAL FILE**
-4. In the **Upload a new dataset dialog**, click **Browse** and find the **german.csv** file you created
-5. Enter a name for the dataset – for this example we’ll call it “UCI German Credit Card Data”
-6. For data type, select “Generic CSV File With no header (.nh.csv)”
-7. Add a description if you’d like
-8. Click **OK**
+Once the data has been converted to CSV format, we need to upload it into ML Studio.
+
+1. In ML Studio, click **+NEW** at the bottom of the window.
+2. Select **DATASET**.
+3. Select **FROM LOCAL FILE**.
+4. In the **Upload a new dataset dialog**, click **Browse** and find the **german.csv** file you created.
+5. Enter a name for the dataset – for this example we’ll call it “UCI German Credit Card Data”.
+6. For data type, select “Generic CSV File With no header (.nh.csv)”.
+7. Add a description if you’d like.
+8. Click **OK**.
![Upload the dataset][1]
@@ -65,4 +65,4 @@ This uploads the data into a Dataset module that we can use in an experiment.
-[1]: ./media/machine-learning-walkthrough-2-upload-data/upload1.png
+[1]: ./media/machine-learning-walkthrough-2-upload-data/upload1.png
Something went wrong with that request. Please try again.