# DSX Local Hands On Lab - Version 1.0.1
## IBM Data Science Elite Team 
### This document is available electronically at the following links:
 > 
1. https://ibm.co/2mwoMK6
2. https://github.com/mwalli/DSXLAB
3. https://dataplatform.ibm.com/analytics/notebooks/v2/61cd7b67-0776-401c-b859-a7f8fb049b05/view?access_token=dc3ec7063543479f110c6801421c35a82cabe8a1cb63dcd1840705d33d168874

# 1. Preparation steps
## 1.1. __Make note of the URL and credentials provided to your team for accessing the the DSX Local system__
The systems you'll be using for this lab are hosted in a cloud environment called Skytap.  This skytap environment contains both DSX Local system and a Hortonworks Data Platform environment including the Ambari systems management tool.  Your team will be assigned to an environment that will also be shared with 4 other teams in the hands on lab.  Please pay close attention to naming of projects, models, and other resources so it will be easy for your team to identify its own assets.

>__Web browser__:  Use the __Chrome or Firefox__ browser on your personal workstation to complete these lab excercises.

> __Certificate warnings__:  When you initially connect to the DSX Local application, depending on whether you are using Firefox or Chrome you will receive an "insecure connection" or "connection not private" warning due to the untrusted certificate that was used during the DSX Local installation.   You can safely ignore these warnings and proceed to the site (Firefox users click "Advanced" and click "Add Exception", Chrome users click "Advanced" and click "Proceed to URL" 

The lab instructors will provide you with the following information.  Make note of:
    1. DSX Local URL: 
    2. DSX Local username/password:
    3. Ambari URL:
    4. Ambari username/password:

> NOTE:  **Use only the URLs and login information provided to your team by the lab instructors.**  These lab excercises will not function properly if you use URL and login information already in use by another team.  Do not share your URLs and login information with other teams.

        
## 1.2. Review data assets in Ambari
| __Filename in HDP__ (in /data/financialservices/churn/) | __Used For Which Lab Steps__                               | __Data Description__                                                               |
|---------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------|
| __churn_rate_visualization.csv__                        | Python Notebook: *Churn Visualization Python HDP*          | Churn rates for Python visualization notebook                                                   |
| __cust_summary_visualization.csv__                      | Python Notebook:  *Churn Visualization Python HDP*         | Summarized customer data for Python visualization notebook    |
| __cust_summary_notebook_training.csv__                  | Scala Notebook:  *Churn ML Training Notebook Scala HDP LR* | Customer churn data for programmatic model training from within a Scala notebook |
| __cust_summary_visbuilder_training.csv__                | Used for training ML model with DSX Visual Model Builder   | Summarized customer churn data for training model with the visual model builder    |

Use the Ambari file browser to familiarize yourself with the data contained within HDP HDFS:
    1. Login to ambari using the credentials you noted above.
    2. Navigate to the "Files view" tool  (Available from the toolbar within the "Views" dropdown menu to the right of the "Admin" menu)
    3. Verify you see the assets listed in the table above in the file browser.  You will create data assets in DSX that correspond to these files.

## 1.3. Download copies of required Jupyter notebooks to your personal system
In this step you will open an existing project in DSX Local that is shared with your user ID.  You will then open and download Jupyter notebooks to your local system so you can reuse them in the following steps.
1. Login to DSX Local as your assigned team user.  
2. You will see a shared project called "BankChurn".  *Note:  Do not perform your lab excercises in this project, you only have "viewer" priveleges for this project.  You will open, then download the notebooks from this project.*
3. Click into the BankChurn project, then click "Assets".  You should see a list of three notebooks contained in the project.
4. Download both the __Churn Visualization Python HDP__ and __Churn ML Training Notebook Scala HDP LR__ notebooks.  
To do this, first open the notebook by clicking its link, then download it using the `File -> Download as -> Notebook (.ipynb)` menu option
>*Note: Remember the download location for these files, you will need these files in a following step.*
5. After downloading each .ipynb file, exit the Jupyter notebook by clicking on the "BankChurn" project name in the link bar.  
6. To conserve system resources, stop the kernel that was launched when you opened each notebook (shown as a green circles) by selecting "Stop Kernel" from the 3-dot menu to the right of the notebook name in the assets list.

# 2. Create a new DSXL project, create data source connection to HDP and create associated data assets
Perform the following steps to prepare your DSX Local environment for the lab excercises
## 2.1 Create a new DSX project
1. Return to your "All Projects" list either by clicking "View all Projects" from the the 3 horizontal bar menu at the top-left corner of the screen, or by clicking "Projects" in the navigation link area of the screen.
2. Click "New Project" and in the "Name" field enter "__[TeamLetter]__Team_BankChurnLab" (e.g. ATeam_BankChurnLab, BTeam_BankChurnLab, etc)
>Note: As you progress through the lab excercises, you will see an asterisk \* next to the project's name and the message *Changes made -- 
You have local changes that you can commit* will be displayed. DSXL internally uses git  to manage project changes.  You can commit changes if you would like to but it is not necessary for the excercises in this lab.  If you did have collaborators added to the project, they would not see any additions or changes made to the project until you commit your changes to the project.  

## 2.2 Create a new data source connection to HDP and add associated data sets to the project.
Your DSX Local system is already configured to connect securely to the HDP HDFS system using the Knox protocol, but each project must contain a "Data source" with the correct connection information for HDFS.  Once the data source is created, you will add "data sets" using the new data source.
1. From within your new project, click "Data Sources" 
2. Click "add data source"
3. Enter the following information exactly as shown below then click the "Create" button at the lower right of the screen:
    * Data source name (this can be any text): HDP HDFS
    * Data source type (dropdown): HDFS - HDP
    * HDFS host: hdp1.atat.ibm.com
    * HDFS port:  8020
    * WebHDFS URL: https://hdp1.atat.ibm.com:8443/gateway/dsx/webhdfs/v1
    >*When you click "Create", the new data source name will be displayed*
4. Click the name of the new data source to enter the "View/Edit data source" screen
5. Scroll to the bottom of the screen and click "Add data set"
6. Click "Browse" - a window opens showing the contents of HDP HDFS, the same content you previewed earlier
>*If you don't see a list of files open up, check your data source to be sure you correctly entered the settings listed above*
7. Navigate to "/data/financialservices/churn/" in the file tree
8. For each of CSV files listed in the table in step 1.2 above, do the following
    * Select the file and select "Open"
    * For "Remote data set name" copy and paste the filename but exclude the csv file extension (churn_rate_visualization, cust_summary_notebook_training, cust_summary_visbuilder_training)
    * Click "Create", the scroll down to click "Add data set" again, then "Browse" and repeat for the remaining files.
    * When finished, click Save
9. Verify the new data sets are accessible
    * Click the name of your project to return to the main project screen
    * Click "Assets" then click "Data Sets"
    * Select "Preview" from the 3-dot menu to the right of each of the data sets.  Preview data should be displayed.

# 3. Create and run the Python customer churn visualization notebook
In this lab section, you will create a new notebook from a file downloaded earlier, modify it to use the correct data sets from HDP HDFS, and run the notebook
1. From within your project, either click "Create notebook" from the top right + menu, or click "Add notebook" from the assets list screen.
2. Select "From File"
3. At the bottom of the screen click "browse", select the __Churn+Visualization+Python+HDP.ipynb__ file (the plus signs were added when you downloaded the file), and click "Open".   The "Name" field is automatically filled in.  (*You can remove the plus signs from the name if you would like to but it is not necessary*)
4. Click "Create".  A "Launching Jupyter" message will be displayed.  __This message may last for a couple minutes, please be patient__
5. Once the notebook opens, you will see a series of cells.  The notebook has not yet been executed and must be modified before running.  Perform the following steps in order:
    1. Run the first cell.  To do this, click inside the first cell at the top of the notebook and then click the "Run" icon in the toolbar (__>|__).  Within the brackets to the left of the cell, you should see an \* appear and then change to the number "1".  This indicates the cell was run successfully.
    2. Now click in the cell below, where you will see a "TODO" comment.  Click just below the "TODO" comment, so the cursor is on the blank line below.  
    3. Now click the "*1001*" "Find data" icon above on the right in the toolbar.  The "Find data" menu opens, click on "Remote" and you will see a list of datasets
    4. Underneath the "churn_rate_visualization" data set, click the dropdown arrow and select "__Insert Pandas DataFrame__".  *(If you don't see this option then see the note below)* Code to load the dataset from HDP HDFS will automatically be inserted into the cell where the cursor was positioned.  Now close the "Find data" menu by clicking the "X" on the left side of the menu.  
    > NOTE:  There is a possible bug here that you may need to work around.  If you see *Insert Spark DataFrame in R* in the dropdown (instead of the Python options) then you will need to close the notebook by clicking on the project name, stop the notebook's kernel by selecting "stop kernel" from the 3-dot menu to the right of the notebook, then reopen the notebook and restart from step A above (rerun cell 1).
    5. After the pandas code is automatically inserted, run the cell by clicking the "Run" icon in the toolbar (__>|__).  You should see the tabular output of the dataframe.head() statement displayed in the cell output.  
    6. Click in the next cell down, find the "TODO" marker.  Follow the instruction, then run the cell.  You should see a Brunel visualization of the churn rate data.
    7. Click in the next cell down, find the "TODO" marker.  Click on the blank line below the comment and insert the "customer summary visualization" dataset as a pandas dataframe by using the "Find data" menu as before.  After the pandas code is automatically inserted, run the cell.  You should see a tabular representation of the dataframe.head() statement in the cell output area.
    8. Click in the next cell down, find the "TODO" marker.  Follow the instructions, then run the cell.  Tabular output should show customer mean income grouped by state.
    9. Click in the next cell down, a markdown cell labeled "Income by state".  While this cell is selected, go to the "Cell" dropdown menu and select "Run All Below" (this runs the selected cell and all cells below it).
    10.  Scroll down through the notebook to ensure that all remaining visualizations of the customer churn data ran and are displayed.
6. When you have completed running the notebook successfully, save the notebook by selecting `File -> Save and Checkpoint`.  The message "Checkpoint created" will appear in the toolbar.
7.  Close the notebook by clicking on your project's name.
8. Before continuing, __Stop the Python Kernel__ that was started for your notebook by selecting "Stop Kernel" from the 3-dot menu to the right of the notebook name.

# 4. Create and run the Scala customer churn notebook 
In this lab section, you will create a new notebook from a file downloaded earlier, modify it to use the correct data set from HDP HDFS, rename the published and deployed models it creates, and run the notebook

1. From within your project, either click "Create notebook" from the top right + menu, or click "Add notebook" from the assets list screen.
2. Select "From File"
3. At the bottom of the screen click "browse", select the __Churn ML Training Notebook Scala HDP LR__ file (the plus signs were added when you downloaded the file), and click "Open".   The "Name" field is automatically filled in.  (*You can remove the plus signs from the name if you would like to but it is not necessary*)
4. Click "Create".  A "Launching Jupyter" message will be displayed.  __This message may last for a couple minutes, please be patient__
5. Once the notebook opens, you will see a series of cells.  The notebook has not yet been executed and must be modified.  Perform the following steps in order:
    1. Run the first 3 code cells.  To do this, click inside the first cell at the top of the notebook and then click the "Run" icon in the toolbar (__>|__).  Within the brackets to the left of the cell, you should see an \* appear and then change to the number "1".  This indicates the cell was run successfully.  Repeat for the next 2 code cells until you reach the first "TODO" marker.
    > NOTE:  Cell 1 in this notebook should only be run one time (refer to the notes in the notebook).  If you do need to rerun this cell, you will need to restart the notebook's kernel.
    2. In the cell below the "Loading data" label, read and follow the instructions in the "TODO" comments.  Use the "Find data" menu as before to insert the remote data set, this time as a Spark DataFrame.  When inserting the dataset code, be sure to click just below the comments, so the cursor is on the blank line below.  __(be sure to rename "sc" to "scl")__  Once you have the code inserted and modified, run the cell.  You should see tabular output from the dataframe.show() statement.
    3. Click in the next cell down, find the "TODO" marker, follow the instruction, then run the cell.  You should see tabular output from the churndata.show(5) statement.
    4. Continue running cells, one at a time, stopping when you have run the cell that renders a brunel visualization of the model's ROC curve.
    5. In the cell below the "Publish Locally" label, find the "TODO" marker and follow the instructions.  Run the cell.
    > NOTE:  When you run this cell you will see 3 warnings from "SLF4J" which you can safely ignore.
    6. Click in the next cell down, below the "Deploy locally" label and find the "TODO" label.  Follow the instructions and run the cell.  You should see output that includes the HttpResponse received from the deployment service.
    7. Run the remaining 3 cells to invoke the model using the scoring service.   You will receive an HttpResponse that includes the results from invoking the model with the values supplied.
    > Note:  The remaining cells in the notebook are optional and require a Bluemix account, the Bluemix WML service, and associated WML credentials.
    
6. When you have completed running the notebook successfully, save the notebook by selecting `File -> Save and Checkpoint`.  The message "Checkpoint created" will appear in the toolbar.
7.  Close the notebook by clicking on your project's name.

# 5. Train, Publish, and Deploy an ML model using the DSX Local Visual Model Builder

In this lab section, you will use the DSXL Visual Model Builder GUI to traink publish, and deploy an ML model using a data file from HDP HDFS

1. From within your project, click "Assets", then either scroll down to "Models" or click the "Models" link along the top.
2. Click "Add model" next to the __+__ sign
3. In the "Name" field, enter "*TEAMNAME* Visual Churn Model" *(replacing TEAMNAME with your team's name - ateam, bteam, etc)*
4. For the "Method" selection, click "__Manual__", the click "Create"
5. On the "Select data asset" screen, click the link for the "__cust_summary_visbuilder_training__" remote data set, a preview of the data set will be displayed.  Click "__Use this data__".  *(alternatively, you can simply click the radio button next to the data set and click next to skip the data preview)*
6. On the "Prepare data set" screen click "Add a transformer" to review the list of available transformers.  For this excercise, we will only require the "Auto Data Preparation" transformer that is selected by default.  After reviewing the list of transformers, click Next.
7. On the "Select a technique" screen, do the following:
    1. In the "Column value to predict" dropdown, select the "Churn" column. (The goal of the model is to predict whether or not a customer will churn)
    2. Click "Binary Classification" as the technique.  (It should already be selected as the suggested technique.  )
    3. Accept the default split shown in the sliders for Train/Test/Holdout 
    4. Click "Add Estimators"
    5. Select "Logistic Regression" and click "Add"  
        > Note: Choose **only one** Estimator (Logistic Regression) for this excercise
    6. Click Next
8. A "Training models" status message will appear - this can take some time, be patient.
9. When model training has completed, review the results, select the radio button next to "Logistic Regression",  click "Save", and click "Save" again in the confirmation dialog.  
    > A "Saving model" status message will appear, then you will be returned to the assets list in the project. 
10. Review the option to publish the model to the Bluemix WML service.
    * Do this by clicking the 3-dot menu to the right of your new ML model and selecting "Publish Model".  Click "Cancel" after reviewing the dialog options.
    > If you were publishing to the WML service, you would paste the username/password credentials (long alphanumeric GUIDs) into this dialog.  The model would be published (saved) to your WML service within the IBM cloud.  
    
11.  Deploy the model to the DSX Local ML service
    1. Select "Deploy" from the 3-dot menu to the right of your saved model
    2. In the "Create Deployment" dialog box, name your deployment "__Deployed *TEAMNAME* Visual Churn Model__" (using your team's name)
    3. Select "Online" from the "Type" dropdown menu and click "Create".  
    4. Wait for the deployment process to complete.  
        > When the deployment is complete, you will be taken to the "Deployments" section of the DSX Local Model Management UI.
    
12. Test the deployed model using the Test API feature in DSX Local
    1. From the "Deployments" list in the Model Management UI, Click the deployment name for the model you created with Visual Modeler.
    2. Review the details for the deployment, then click "Test API"
    3. On the Test API screen, keep the the default test values and click "Predict".  The model's prediction for this customer to churn appears, along with a pie chart representation of the probabilities.
        > Increase NEGTWEETS to 10 and click "Predict" again.  Note how the predicted value for this customer to churn changed.  
13. When finished, click "Close" to exit the Test API screen

# 6. Use Model Management Features to schedule evaluation of deployed ML model

In this lab section, you will use the model management featues in DSXL to schedule periodic evaluation of a deployed ML model

1. From the 3-Horizontal Line menu (aka "Hamburger" menu at the top left of DSXL UI), select "Model Management"
2. Click the Deployments link to see all deployments
3. Click the deployment of the model that __your team__ created with the visual modeler
4. Scroll to the bottom of the deployment details screen and click "Schedule Evaluation"
5. On the "Schedule Evaluation" screen, do the following
    1. Choose "BinaryClassiferEvaluator" from the Evaluator dropdown menu
    2. Check the "Use performance metrics to monitor this model" checkbox
    3. Keep the the radio button for "areaUnderROC" selected and accept the default of .7 for "Notify when less than"
    4. In the "Schedule" section, click on the "Starts at" selection and slide both sliders all the way to the left (this will schedule evaluation for as soon as possible)
    5. Select "Every Day" as the Repeat option
    6. In the "Remote Data Sets" section, select "cust_summary_visbuilder_training" as the evaluation data set and then click "__Schedule__"
    > Note:  Normally the evaluation data set would contain updated data.  In this case we are evaluating the model using the same data that was used for training.
6. Within a few minutes (shortly after the scheduled evaluation time) you will see the result of the model evaluation displayed in the Model Management UI.  
    > The completed deployment evaluation should show a green checkmark (indicating success) on the Dashboard tab of the Model Management UI.  The list of all deployment evaluations for a deployed model are visible at the bottom of the deployment details window. 

****
### *This DSX Local Hands On Lab and associated Skytap environment was created by the IBM Data Science Elite Team*
