Feb 1, 2023 HMS Data Analysis Club
We'll start by going through this notebook together. Collab notebook are an online Python notebook hosted by Google.
There are a number of ways to install Python. You can directly install a Python distribution from the Python foundation. However, today we will instead be installing miniconda. Miniconda is based on Anaconda, a Python distribution which includes many popular scientific packages. However, it also takes up a lot of file space. Miniconda is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. You can download and install miniconda here.
We're going to use PyCharm as our developement environment for Python. You're welcome to use another (Visual Studio Cody, Spyder, Emacs, Vim, etc.) but we will be mainly supporting PyCharm. Make sure to select the Community Version. You can download PyCharm here.
Once PyCharm is installed, you can use it to open the file scripts/ml-workflow.py.
If PyCharm asks you to create a new project, select yes.
Once inside PyCharm, open up the terminal. If you are on Windows, make sure to select the command prompt option under the ∨ menu:
This is a unix terminal emulator built into PyCharm, which should allow everyone to have the same command line interface.
If you want to know some more about Conda and Python environments, there's a great lesson here. Generally, an environment is a separate installation of Python with its own set of packages. This allows a user to have multiple versions of Python and various packages around at the same time. Conda is a tool for managing Python environments which allows you to create, alter, export, and import environments.
This is extremely useful for reproducable research. A researcher can export their Python environment using conda, which will then allow anyone trying to reproduce their work to ensure they have the exact same packages and versions installed.
Before we begin, we need to make sure conda has had a chance to initialize.
In the terminal, run the command conda init.
Now, we can create a new environment.
Let's call it ml-env, since we'll be using it to run a machine learning workflow.
When creating an environment, we need to specify a name and we should specify which Python version we want to use.
conda create --name ml-env python=3.10
It might take a few minutes to install, but we should have a new environment. We can see the current list of environments with:
conda env list
While we can see our new environment, there probably is a star next to the base environment.
This indicates that we are still in base; we are not yet 'in' the ml-env environment.
We need to use the conda activate command to enter ml-env.
conda activate ml-env
We should be able to list environments again and see that the star has changed.
Additionally, (ml-env) should print at the start of each terminal line.
Now that we are in our environment, we need to install the right packages into it.
If we look at the import statements at the top of ml-workflow.py:
import pandas as pd
from sklearn import model_selection, ensemble, metrics
import numpy as np
from matplotlib import pyplot as plt
import seaborn as snsWe can see the packages we want to use and should install into our newly created environment.
Packages are installed using the conda install command.
For instance, we install pandas with:
conda install pandas
Go through this for each of the imported packages.
Note that the sklearn package, which is the most popular Python package for machine learning, is installed as scikit-learn.
Finally, move into the scripts directory:
cd scripts
We should now be able to run our script. This script performs classification, using telomere sequence data to predict which pathway cancer cells used to achieve replicative immortality. The dataset is taken from this paper by Lee et al, 2018.
python ml-workflow.py
If everything worked, we should see some figures appear!
If you were able to get everything else working, try installing and running the code from the following paper's git repository: https://github.com/greenelab/wenda_gpu_paper.