# Azure ML OpenHack-Lite

## Challenge 1: Exploring and Preparing Data

Data exploration and preparation accounts for around 80% of most machine learning projects - particularly when dealing with large volumes of tabular data.

Within the time-constraints of this OpenHack, we'll use a more "directed" approach than a typical Openhack challenge so that we can achieve the goal of exploring the data preparation support in the Azure ML SDK, in particular datasets; while preparing data for the subsequent challenges. We encourage you to use this challenge as an opportunity to get an introduction to the Azure ML SDK, and explore the SDK in greater depth on your own later.

### Get the Data into Azure Storage
Throughout this OpenHack we'll be using data from the [Microsoft Malware Prediction Kaggle Competition](https://www.kaggle.com/c/microsoft-malware-prediction), so you'll need a Kaggle account. If you don't already have one, [sign up](https://www.kaggle.com/) for one now. Then:

1. Download the [competition data files](https://www.kaggle.com/c/microsoft-malware-prediction/data).
2. On your Azure subscription for this OpenHack, create a new Azure Storage account and a *private* blob container.
3. Upload the competition data files to your Azure blob container.



### Install the Azure ML SDKs

Now that you have the source data in place, use the `pip` utility to install the latest versions of the **azureml-sdk** package (including the *automl*, *explain*, and *notebooks* "extras"), and then import the **azureml.core** and **azureml.dataprep** namespaces.

> **Hint**: See the [Azure ML documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py)

In [None]:
!pip install --upgrade azureml-sdk
!pip install --upgrade azureml-dataprep

### Create an Azure ML Workspace

All Azure ML activities are performed within the context of a workspace, so the first thing you need to do is to create a new workspace for this OpenHack.

> **Hints**:
>
> - You can create a workspace using the [Azure portal](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace) or by using the <a href = "https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class)?view=azure-ml-py" target="_blank">Azure ML SDK</a>. For this OpenHack, we recommend using the SDK.
> - After creating the workspace, use its `write_config()` method to save its configuration locally. That way you'll be able to reconnect to it later by using the `Workspace.from_config()` function.

In [None]:
from azureml.core import Workspace

name = 'YourWorkspaceNameGoesHere'
subscription_id='YourWorkspaceNameGoesHere'
resource_group='YourResourceGroupNameGoesHere'
location='YourDatacenterLocationGoesHere'

try:
    ws =  Workspace.get(name=name, subscription_id=subscription_id, resource_group=resource_group)
except:
    ws = Workspace.create(name=name,
                     subscription_id=subscription_id,
                     resource_group=resource_group,
                     location=location)
ws.get_details()
ws.write_config()

In [None]:
# How to get the config file from 'some' location:
# ws = Workspace.from_config()

### Register your Blob Container as a Datastore

*Datastores* provide a way to attach data storage to your Azure ML workspace. This makes it easier for processes in your workspace to access your data regardless of the compute instance within which they are running. Every Azure ML workspace has a *default* datastore based on Azure blob storage, to which you can add your own. In this case, you'll register the Azure blob container to which you uploaded the malware detection competition data as a datastore in your workspace.

>- **Hint**: Refer to the [Azure ML documentation on datastores](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data).
>- You can upload the training data to your blob store either via the datastore object in the SDK, or manually.

In [None]:
import azureml.core
from azureml.core import Workspace, Datastore

dataStoreName = 'DataStoreNameGoesHere'
containerName = 'ContainerNameGoesHere'
accountName = 'StorageAccountNameGoesHere'
accountKey = 'StorageAccountKeyGoesHere'

if ws.datastores.get(dataStoreName):
    ds = ws.datastores.get(dataStoreName)
else:
    ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name=dataStoreName,
                                             container_name=containerName,
                                             account_name=accountName, 
                                             account_key=accountKey)

### Use Datasets to Load the Training Data

Now you're ready to start using the AzureML SDK with Datasets to explore your data. Start by reading the **train.csv** into a Dataset, and then view the first 100 rows.

> **Hints**:
>
> - When loading the data, use the the [auto_read_files](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#auto-read-files-path--include-path-false-) function, which automatically detects the file format and data types.
> - Take a look at [this documentation](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets?branch=release-build-amls#create-datasets-from-azure-datastores) for an example of using the Dataset SDK to read data from a datastore.
> - You can use the **Dataset** object's [head](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py#head-count-) method to view the first 100 rows of data.

In [None]:
from azureml.core import Dataset

# create an in-memory Dataset on your local machine
datapath = ds.path('NameOfYour.csvFileGoesHere')
dataset = Dataset.auto_read_files(datapath)

# returns the first 5 rows of the Dataset as a pandas Dataframe.
dataset.head(100)

There are 83 columns in the dataset. The **HasDetections** column contains the *label* for our machine learning problem (the value we want our model to predict), leaving 82 columns as potential *features* (attributes of the data observations that might help predict the label); which is quite a lot of features. One of the key goals in data exploration is to identify which of these features (either individually or in combination) are predictive of the label; and which are just noise that should be removed before training a model.

### Register the Dataset

Now that we have uploaded our raw dataset, use the Datasets feature of AzureML to track the history of the data, and register this dataset.
Explore [this doc](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets?branch=release-build-amls#register-your-datasets-with-workspace) to see an example of registering a dataset. This will be our "Raw Training Data".

In [None]:
datasetName = 'NameForYourDataset'
description = 'JustWriteADescriptionHere'
dataset = dataset.register(workspace = ws,
                           name = datasetName,
                           description = description,
                           exist_ok = True)

### Subset the Data

The full dataset is extremely large, and we'll run into some resource constraints if we try to explore it all in one go. So create a subset containing a random sample of 20% of the data.

Then, register the sample data as your "Training Data" dataset.

> **Hints**:
>
> - Use the [sample](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-explore-prepare-data?branch=release-build-amls#sampling) method to randomly sample 20% of the data.


In [None]:
import random

# create seed for Simple Random and Stratified sampling
seed = random.randint(0, 4294967295)

simple_random_sample_dataset = dataset.sample('simple_random', {'probability':0.2, 'seed': seed})
simple_random_sample_dataset.to_pandas_dataframe()

In [None]:
datasetRandomTwenty = simple_random_sample_dataset.register(workspace = ws,
                           name = 'PutANameForYourNewDatasetHere',
                           description = 'This is our data set randomly sampled at 20%',
                           exist_ok = False)

### Cache the Data

Datasets are defined via Dataflows. Dataflows use a "lazy read" technique that means data is not actually read until it is needed - so some operations seem to complete quickly, but in reality they're just "queued" until a downstream operation actually needs the data to perform a calculation or display data values. Even with a subset of the data,  simple operations like displaying the row count can still take a long time if we need to read the data from its source each time an operation needs it. One way to overcome this is to cache the results of a dataflow operation locally. Let's take 100,000 rows of data and cache them.

>- Use the [get_definition](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#get-definition-version-id-none-) method to work with the latest version of the dataset's dataflow.
> - Use the [take](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#take-count--int-----azureml-dataprep-api-dataflow-dataflow) method to take the first 100,000 rows from the sample dataflow.
> - Use the [cache](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#cache-directory-path--str-----azureml-dataprep-api-dataflow-dataflow) method to cache the subset of data from the dataflow locally.
> - Use the [row_count](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#row-count) attribute to verify that the subset dataflow contains 100,000 rows.

*Note: Some of the methods referenced above are actually from the **DataPrep.Dataflow** class - the **Dataset** is in some ways a wrapper around DataPrep Dataflows as explained [here](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-explore-prepare-data?branch=release-build-amls).*

In [None]:
datasetDefinition = datasetRandomTwenty.get_definition()

In [None]:
datasetFirst100KRows = datasetDefinition.take(100000)

In [None]:
datasetFirst100KRows.cache('./cache')

In [None]:
datasetFirst100KRows.row_count

In [None]:
datasetRandomTwenty.get_definition()

### Profile the Data

Now we can profile the data in our random subset - this will give us a more detailed look at the features in our dataset. 

> **Hint**: Use the [get_profile](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-explore-prepare-data?branch=release-build-amls#explore-with-summary-statistics) method of the Dataset object to view a profile of the data.
>- You can get the profile of the 100,000 row subset, or you can profile your snapshot. 
>- The snapshot's profile was calculated during creation, so this may be faster.

In [None]:
datasetFirst100KRows.get_profile()

### Drop Sparse and Redundant Columns

Examine the profile of the data, and in particular take note of columns that contain a high **Empty count** or **Missing Count**. If more than 20% of the values in a column are empty or missing (so, more than 20,000), that's a good indication that this column is unlikely to be useful, and can be dropped before we go any further.

Additionally, looking at the **Census_PrimaryDiskTotalCapacity** and **Census_SystemVolumeTotalCapacity** columns, the statistics are virtually identical - indicating that these columns may have a high-degree of overlap, making one of them redundant.

*To save you some time, the following code includes lists of columns that have empty values in 20% or more of cases, or are redundant - you must add the code to drop them.*

> **Hint**: Use the [drop_columns](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#drop-columns-columns--multicolumnselection-----azureml-dataprep-api-dataflow-dataflow) method to drop the non-predictive columns. Then get the profile of the dataflow containing the remaining columns.

In [None]:
cols_to_drop = ["DefaultBrowsersIdentifier",
                "OrganizationIdentifier",
                "PuaMode",
                "SmartScreen",
                "Census_ProcessorClass",
                "Census_InternalBatteryType",
                "Census_IsFlightingInternal",
                "Census_ThresholdOptIn",
                "Census_IsWIMBootEnabled",
                "Census_SystemVolumeTotalCapacity"]

# Add code to drop these columns
datasetFirst100KRows = datasetFirst100KRows.drop_columns(cols_to_drop)

In [None]:
datasetFirst100KRows

### Handle Outliers, Nulls, and Data Errors

If you examine at the profile of the data, you might also spot a few more problems that we need to resolve:

There are still some columns with empty or missing (or *null*) values. There are a variety of ways you can handle nulls; such as setting them to a specific value, interpolating values based on a sequence, substituting the mean or median value for numeroc columns, or simply removing rows that contain any null columns. We'll keep things simple and replace null values with **0**.

The profile reports some errors in the **Census_PrimaryDiskTotalCapacity** column, so there are some invalid values in that column. Again, we'll keep things simple and replace the errors with **0**.

Additionally, look closely at the **Census_TotalPhysicalRAM** column. The amount of RAM in the machines ranges from 512 to what seems like a pretty high value for RAM. Now look at the **Mean** value, which should be much closer to the minimum than the maximum, and looking at the various **quartile** values for this column, it looks like there are some extremely high values in the 99.9th percentile that account for the high maximum. These are likely to be outliers - i.e. untypically high values that might just be extremely rare, or might be indicative of a data entry error. A common way to deal with outliers is to set minimum and maximum threshold values and set outliers to the appropriate threshold.

Let's deal with those issues:

> **Hints**:
>
> - Use the [replace_na](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#fill-nulls-columns--multicolumnselection--fill-with--typing-any-----azureml-dataprep-api-dataflow-dataflow) function to replace empty values in all columns with null. To specify all columns, use a [ColumnSelector](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.columnselector?view=azure-dataprep-py). There's an example of using this class in the [Data Prep SDK Tutorial](https://docs.microsoft.com/en-gb/azure/machine-learning/service/tutorial-data-prep?toc=%2Fen-us%2Fpython%2Fapi%2Fdataprep_py_toc%2Ftoc.json%3Fview%3Dazure-dataprep-py&bc=%2Fen-us%2Fpython%2Fdataprep_py_breadcrumb%2Ftoc.json%3Fview%3Dazure-dataprep-py&view=azure-dataprep-py#cleanse-data).
> - Use the [fill_nulls](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#fill-nulls-columns--multicolumnselection--fill-with--typing-any-----azureml-dataprep-api-dataflow-dataflow) method to replace null values in all columns with **0**.
> - Use the [fill_errors](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#fill-errors-columns--multicolumnselection--fill-with--typing-any-----azureml-dataprep-api-dataflow-dataflow) method to set all **Census_PrimaryDiskTotalCapacity** error values to **0**.
> - Use the [clip](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#clip-columns--multicolumnselection--lower--typing-union-float--nonetype----none--upper--typing-union-float--nonetype----none--use-values--bool---true-----azureml-dataprep-api-dataflow-dataflow) method to ensure that all **Census_TotalPhysicalRAM** values are between 0 and 16384 - all values outside of these thresholds should be set to the lower (0) or upper (16384) threshold value.
> - After making these changes, view the profile of the modified dataflow.

In [None]:
from azureml.dataprep import ColumnSelector
column_selector = ColumnSelector(term=".*",
                                 use_regex=True)

datasetFirst100KRows = datasetFirst100KRows.replace_na(column_selector)
datasetFirst100KRows = datasetFirst100KRows.fill_nulls(column_selector, 0)
datasetFirst100KRows = datasetFirst100KRows.fill_errors('Census_PrimaryDiskTotalCapacity', 0)
datasetFirst100KRows = datasetFirst100KRows.clip('Census_TotalPhysicalRAM',0,16384)

In [None]:
datasetFirst100KRows

### Convert the Data to a Pandas Dataframe

We've discarded the obviously unhelpful columns and cleaned up the remaining columns, it's time to dig a little deeper. Most data scientists have expertise in exploring data for feature selection ane engineering, and many will use conventional Python libraries, like **pandas** to do this; so start by converting the dataflow into a pandas **dataframe** object named **explore_df**.

> **Hints**:
>
> - IMPORTANT: The remainder of the cells below assume you name the dataframe **explore_df**.
> - Use the Dataset object's [to_pandas_dataframe](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#to-pandas-dataframe--) method to create a dataframe from the data in the dataflow.
> - Use the pandas dataframe [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method to verify that you can see the first 10 rows of data in the dataframe.

In [None]:
df = datasetFirst100KRows.to_pandas_dataframe()

Now we're ready to use pandas and other conventional Python libraries to explore the data. For expediency, and because the focus of this challenge is on the Azure ML Data Prep SDK, the code to explore the data has been provided - run the cells below, reading the notes so you follow what's going on.

### View Summary Statistics

First, let's view some summary statistics for the data.

In [None]:
explore_df = df

In [None]:
from IPython.display import display
import pandas as pd

# Ensure the output is not truncated
pd.options.display.max_columns = None

# Get summary stats for all columns

explore_df.describe(include="all")


From this, you can see that there are 100,000 unique **MachineIdentifier** values, so this is obviously a unique key field for the records - and therefore not likely to be predictive. Let's drop it from the dataframe. We'll also add it to the existing list of columns to be dropped from the dataflow (you'll see why later!):

In [None]:
explore_df = explore_df.drop('MachineIdentifier', 1)
cols_to_drop.append('MachineIdentifier')

### Visualize distribution of HasDetections vs !HasDetections

It's common to explore data by plotting visualizations - this can often make it easier to see patterns or trends in the data as you try to identify features that will help predict a label. In this case, we're trying to predict the label column **HasDetections**, so let's take a moment to check the distribution of this field:

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

detection_counts = explore_df['HasDetections'].value_counts() # find the counts for each unique category

fig = plt.figure(figsize=(9,6))
ax = fig.gca()    
detection_counts.plot.bar(ax = ax, color=["green", "red"]) 
ax.set_title('Machines with Detections') 
ax.set_xlabel('Detections') 
ax.set_ylabel('Machines')
plt.show()

It looks like there's a fairly even split of cases where **HasDetections** is 1 (*true*) or 0 (*false*).

### Visualize Categorical Variables vs Label as bar charts

Some of our features are *categorical*; that is, they represent identifiers for categories rather than numeric amounts or measures. Let's compare the distribution of the label column by category to see if there are any obvious cases of categories correlating to label values.

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt


# Split the data into two dataframes - one for each label value
detections_df = explore_df[(explore_df.HasDetections==1)]
nondetections_df = explore_df[(explore_df.HasDetections==0)]

# Get the numeric features
num_cols = ["AVProductsInstalled",
            "AVProductsEnabled",
            "OsBuild",
            "Census_ProcessorCoreCount",
            "Census_InternalBatteryNumberOfCharges",
            "Census_OSBuildNumber",
            "Census_OSBuildRevision",
            "Census_PrimaryDiskTotalCapacity",
            "Census_TotalPhysicalRAM",
            "Census_InternalPrimaryDiagonalDisplaySizeInInches",
            "Census_InternalPrimaryDisplayResolutionHorizontal",
            "Census_InternalPrimaryDisplayResolutionVertical"]

# Get the categorical features
cat_cols = list(detections_df.columns)
non_cat_cols = num_cols.copy()
non_cat_cols.append("HasDetections")
for col in non_cat_cols:
    cat_cols.remove(col)

# Plot the frequency of each categorical column value for both datasets side-by-side
for column in cat_cols:
    pos = detections_df[column].value_counts()
    neg = nondetections_df[column].value_counts()
    # Don't try to plot features with over 100 distinct categorical values
    if len(pos) < 101:
        fig = plt.figure(figsize=(18, 6))
        fig.clf()
        ax1 = fig.add_subplot(1, 2, 1)
        ax0 = fig.add_subplot(1, 2, 2) 
        pos.plot.bar(ax = ax1, color="red")
        ax1.set_title("Positive by " + column)
        neg.plot.bar(ax = ax0, color="green")
        ax0.set_title("Negative by " + column)
plt.show()


### Visualize Numeric Variables vs label as Box Plots

To compare the distribution of our numeric features to the label, we'll use box plots:

In [None]:
for col in num_cols:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    explore_df.boxplot(column = col, by = "HasDetections", ax = ax)
    ax.set_title(col)
    ax.set_ylabel(col)
plt.show()

### Use a Chi-Squared test to determine useful categorical features

The plots show some potential correlations - for example, the box plots of **AVProductsInstalled** shows a clear differentiation bretween the machines that have detections and those that don't; but in most cases they're fairly inconclusive - let's try applying some statistical techniques.

A Chi<sup>2</sup> test is a good way to determine correlation between categorical features and labels, so we'll use it to find columns where the probability of an apparent correlation being detected by chance is very low - these are important columns we should keep, and conversely the remaining columns are not predictive and should be dropped.

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency

alpha = 0.005
Y = explore_df["HasDetections"].astype(str)
    
# Categorical feature Selection
for var in cat_cols:
    X = explore_df[var].astype(str)
    df_crosstab = pd.crosstab(Y,X)
    chi2, p, dof, expected = chi2_contingency(df_crosstab)
    if p < alpha:
        print("{0} is IMPORTANT".format(var))
    else:
        print("{0} is not important".format(var))
        cols_to_drop.append(var)

### Use ANOVA to Get Most Important Numeric Columns

For our numeric columsn, we'll use the **SelectKBest** function from the **scikit-learn** library - this performs an analysis of variance (ANOVA) test to determine correlation between numeric features and a categorical label.

In [None]:
from sklearn.feature_selection import SelectKBest

X = explore_df[num_cols].astype(np.float)
X.fillna(0, inplace=True)
y = explore_df["HasDetections"]

# Find the 4 most important numeric columns
X_new = SelectKBest(k=4).fit(X, y)

for i in range(len(num_cols)):
    if X_new.get_support()[i]:
        print("{0} is IMPORTANT".format(num_cols[i]))
    else:
        print("{0} is not important".format(num_cols[i]))
        cols_to_drop.append(num_cols[i])

### View Colums to be Dropped

So now we have a few columns that we've identified as non-predictive, let's review them:

In [None]:
print("Columns to be dropped\n----------------------------")
for col in sorted(cols_to_drop):
    print(col)

### Add a Few More

In a real project, you'd perform more analysis over several iterations. To keep things moving in this hack, we'll just add a few more columns that have been found to be unhelpful in predicting the label to our list of columns to be removed.

In [None]:
# get rid of some other columns just to make things easier in the hack!
more_columns = ['AVProductStatesIdentifier',
                'OsPlatformSubRelease',
                'OsSuite',
                'OsBuildLab',
                'SkuEdition',
                'SMode',
                'Census_OSVersion',
                'Census_OSBranch',
                'Census_OSEdition',
                'Census_OSSkuName',
                'Census_OSInstallTypeName',
                'Census_OSWUAutoUpdateOptionsName',
                'Census_ActivationChannel',
                'CountryIdentifier',
                'AvSigVersionEncoded',
                'Platform',
                'Processor',
                'Census_MDC2FormFactor',
                'Census_DeviceFamily',
                'Census_PrimaryDiskTypeName',
                'Census_OSArchitecture',
                'Census_GenuineStateName', 
                'Census_PowerPlatformRoleName',
                'AvSigVersion',
                'Census_ChassisTypeName'
               ]
for col in more_columns:
    cols_to_drop.append(col)
print("Columns to be dropped\n----------------------------")
for col in sorted(cols_to_drop):
    print(col)

### Drop the Columns from the Data Flow

OK, after this brief diversion into pandas, now we're ready to return to the dataset variable from which we generated the Pandas dataframe and pick up where we left off. Drop all of the columns we've decided aren't useful from this dataflow, and review the profile of the remaining data.

### Encode Categorical Columns

Many of our columns are categorical; some (for example, **RtpStateBitfield**)  using an integer number to represent a category, and others using a string value. Most machine learning algorithms expect all features, even categorical ones, to be represented by numeric values, so we need to encode the remaining string values as numerics.

There's a whole range of techniques that you can use to encode categorical variables, from simple [label encoding](https://github.com/Microsoft/AMLDataPrepDocs/blob/master/how-to-guides/label-encoder.ipynb) (in which each possible categorical value is mapped to a unique integer number) to [one-hot encoding](https://github.com/Microsoft/AMLDataPrepDocs/blob/master/how-to-guides/one-hot-encoder.ipynb) (in which a column is created for each possible categorical value and a binary *1* or *0* is used to indicate to which category each observation row belongs).

In this hack, take the simple option and use *label encoding* to encode the remaining string columns (**EngineVersion**, **AppVersion**, and **Census_FlightRing**)

> **Hints**:
>
> - Use the [label_encode](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#label-encode-source-column--str--new-column-name--str-----azureml-dataprep-api-dataflow-dataflow) method to create an encoded version of each string column.
> - After creating an encoded version of each string column, use the [drop_columns](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#drop-columns-columns--multicolumnselection-----azureml-dataprep-api-dataflow-dataflow) method to drop the original string columns.

### Normalize Numeric Columns

Now that we've prepared our categorical columns for modeling, let's turn our attention to the numeric columns. A quick look at the profile shows a wide disparity in the possible range of values for our numeric features. For example, **AVProductsInstalled** ranges from 0 to 5, while **Census_TotalPhysicalRAM** ranges from 512 to 16384. In some algorithms, this disparity in scale might cause the larger numbered features to dominate the training of the model, so we should normalize the numerics to be on a similar scale, with the values in each column proportionally representing the original unscaled values.

There are various ways you can normalize numeric columns, with the optimal choice depending on how the data is distributed. We'll take an easy way out and just scale all of our numeric values between 0 and 1 based on the minimum and maximum values. 

> **Hints**:
>
> - Use the [min_max_scale](https://docs.microsoft.com/en-us/python/api/azureml-dataprep/azureml.dataprep.dataflow?view=azure-dataprep-py#min-max-scale-column--str--range-min--float---0--range-max--float---1--data-min--float---none--data-max--float---none-----azureml-dataprep-api-dataflow-dataflow) method to scale the numeric columns between 0 and 1.
> - You should scale only the columns that are *numeric* in the sense that they represent a measure or quantity - do not include integer columns that represent categorical values.
> - View the profile of the data after normalizing the numeric columns.

## Update the Dataset Definition to include data preparation

We've now completed all the steps we need to select and prepare the features we want to use when training a model.

We can now update the definition of our Dataset to reflect all of these changes - that way should the source data be updated, the same data transformations can be easily applied again.

> **Hint**: Use the [update_defintion](https://review.docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-dataset-definitions?branch=release-build-amls#update-dataset-definitions) method to update your registered dataset.

### Review the Dataset

Now we have updated our dataset, we can pull it back down and review it.

In [None]:
from azureml.core import Datastore, Dataset, Workspace

ws = Workspace.from_config()

#get the dataset

ds = Dataset.get(ws, "<YOUR DATASET NAME>")

ds.head(100)

### Optional: Save the data to CSV and upload to Datastore

You can output the Dataset to a pandas dataframe, and then save that as a CSV. Once it is in the CSV format, you can upload it to your Datastore, which may be useful if you are building custom models or not working exclusively within the AzureML workspace.

>**HINTS**
>
>- The [to_pandas_dataframe](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py#to-pandas-dataframe--) method may be of use here for prepping to output a CSV.