Skip to content

Python-based machine learning and data science module from SFSU developed for the NIGMS Sandbox project

Notifications You must be signed in to change notification settings

NIGMS/Introduction-to-Data-Science-for-Biology

Repository files navigation

course card

San Francisco State University

Introduction to Machine Learning for COVID Predictions


This module teaches you how to create a simple Decision Tree using a structured dataset. In addition to the overview given in this README you will find the four Jupyter notebooks. The second notebook is optional.

  • 1- Intro to Machine Learning: Decision Trees: This notebook provides a basic introduction to Machine Learning concepts, steps for creating and understanding a Decision Tree model, making predictions with it, and intuitively evaluating its performance.

  • 2- (Optional) Quant. Comparison of 2020 DT Model Performance for (2020 vs 2021) Data: This notebook is optional, for students who would like to know a bit more about how to evaluate model performance quantitatively, and offers an introduction to why machine learning models require retraining from time to time.

  • 3- Practice: This notebook provides a way to practice and test what you have learned from the first notebook. It includes basic instructions outlining every step discussed in the first notebook. Students are free to either copy and modify the code from the first notebook or they can choose to write it themselves.

  • 4- Practice - Answer Key: This notebook provides the answers and explanation to the previous Practice exercise notebook. Check this notebook only after you have tried to complete the previous exercise yourself.

This module will cost you about $1.00 to run, assuming you tear down all resources upon completion.

Overview of Page Contents

Getting Started

Included is a tutorial in the form of Jupyter notebooks. The main purpose of the tutorial is to help beginners without much coding experience to familiarize themselves with basic fundamental concepts within machine learning using health data (COVID dataset). It is also meant to be extended to other kinds of structured data. The tutorial walks through step by step the process of creating a Decision Tree and interpreting it. This module intends to provide an intuitive understanding of how machine learning model performance is evaluated. In order to get to this module from the Google Cloud Platform, you will need to have access to a Google Cloud Platform account, this module is located within Vertex AI Workbench. For more technical information about Google Cloud Platform please click on the following link: NIH Cloud Lab README

1) Please click on the link for steps to open your GCP project: How to open your GCP Project.

2) Follow the steps highlighted here to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select Debian 11 and Python 3 in the Environment tab in step 5. In step 6 in the Machine type tab, select n1-standard-4 from the dropdown box.

3) Now you will need to download the tutorial files from GitHub. The easiest way to do this would be to clone the repository from NIGMS into your Vertex AI notebook. This can be done by using the Git menu in JupyterLab, and selecting the clone option. To clone this repository, use the Git command git clone https://github.com/NIGMS/Introduction-to-Data-Science-for-Biology.git in the dropdown menu option in Jupyter notebook. Please make sure you only enter the link for the repository that you want to clone. There are other bioinformatics related learning modules available in the NIGMS Repository. This will download our tutorial files into a folder called Introduction-to-Data-Science-for-Biology.

IMPORTANT NOTE

Make sure that after you are done with the module, close the tab that appeared when you clicked OPEN JUPYTERLAB, then check the box next to the name of the notebook you created in step 3. Then click on STOP at the top of the Workbench menu. Wait and make sure that the icon next to your notebook is grayed out.

Overview

This module is geared towards beginners and does not require prior knowledge on a specific scientific discipline. The module is divided into three Jupyter notebooks as outlined at the beginning of this document. In addition to the notebooks mentioned, there are videos containing brief explanations about basic concepts in machine learning and what the code does in each step of the notebook. Below is an outline of the videos contained in each notebook with their respective links. These videos are already attached to the notebook.

1- Introduction To Machine Learning: Decision Trees (10 video clips)

2- (Optional) Quant. Comparison of 2020 DT Model Performance for (2020 vs 2021) Data

3- Practice Exercise ( 1 video clip)

4- Practice Exercise - Answer Key (1 video clip )

Software Requirements

Software requirements are satisfied by using a pre-made Google Cloud Platform environment Workbench Notebook. The notebook environment used is named "Python 3 with Intel® MKL" ; and it is listed during Step 3 for accessing our module. In addition all package requirements are installed by following the instructions Step 1 of the notebook "Intro to Machine Learning Decision Trees".

Workflow Diagrams

Submodule 1 and Submodule 3 will download CSV files stored in a Google Cloud Storage bucket to the Workbench notebook, then it will output additional CSV files that will be used optionally if students want to work on the (optional) Submodule 2. Below is a diagram that illustrates our workflow:

Architecture-diagram.PNG

Data

All original data from this module was originally sourced from the following sites:

We subsequently picked only certain variables of interest, cleaned and created a composite dataset for the years 2020 and 2021 from the sources listed above. We manipulated the variable named "Unemployment_rate" by using the 2020 rates in both the 2020 and 2021 Datasets. We then separated these datasets into training, validation, and testing sets for each of these years to streamline the tutorials. Finally, we stored them in our group's SFSU GitHub repository.

Funding

  • SFSU/UCSF M.S. Bridges to the Doctorate Program: cloud-based learning modules supplement (T32GM142515)
  • Demystifying Machine Learning and Best Data Practices Workshop Series for Underrepresented STEM Undergraduate and MS Researchers bound for PhD Training Programs (T34-GM008574)
  • The creation of this training module was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number 3T32GM142515-01S1

License for Data

Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available here.

Creative commons license

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

About

Python-based machine learning and data science module from SFSU developed for the NIGMS Sandbox project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published