# 1. Python Installation and Reproducible Workflow

AUTHOR

Python Group

🙌 Welcome to the first session of the Introduction to Python Workshop series. This workshop is intended as a strating guide for those who want to learn Python as a practical tool for data manipulation, analysis, and computation. 

Here's a breakdown of the workshop structure:  

1. *Python Installation and Reproducible Workflow*
2. *Python Basics: Data Structure*
3. *Data Manipulation and Analysis with Pandas*
4. *Object-Oriented Programming and Machine Learning Libraries*


The goal of the workshop is to give people a flavor of data science in Python. We will start off with the very basic--i.e., installations (**session 1**). There are many ways to install Python and many choices of code editor (too many in our opinion!). For the scope of this workshop, we will show one way that we think is intuitive, efficient, and offers flexible and reproducible workflow, particularly from the perspective of R users and statisticians.

Moving on to programming, we will talk about the fundamental concepts of the Python language such as data types and data structure (**session 2 & 3**), which will help you understand how similarly/differently Python operates compared to R. 

Then, we will introduce the concept of classes/objects--the building blocks of Python object-oriented programming (OOP) and almost all Python libraries--and learn how to create and use them by following along coding practices in **session 4**. 

*We assume you have basic programming knowledge, including concepts of libraries and functions, overall workflow of analytical projects, and basic command line skills.*

## Session 1: Toward building a reproducible analytical project in Python 

We will start with the very basic, i.e., installations, and walk through the standard practices of reproducible workflows when working with Python.

Due to time constraints, we put installation step by in this follow along [installation ](https://github.mskcc.org/pages/Python-Workshop/Python-Workshop.github.io/session1/session1.html) installation follow-along guide. We essential components of an **interactive programming ecosystem** for Python programming.

In the second half of this session, we will walk through **steps of building a reproducible Python data analysis project**. We will create a GitHub repository to track project version updates and use virtual environments for reproducibility. 


### Objectives

- Have Python and conda installed locally.

- Set up Visual Studio Code (VS Code) and install extensions to let it recognize your local python installations and conda virtual environments.

- Clone the workshop project folder from GitHub (requires logging in to MSK Enterprise) and open it in VS code.

- Create a conda virtual environment for the workshop project, activate it, and create a data folder and python script from within the VS Code workspace. 


## Before we start...

### Requirements

There is no restriction on the computer system for the series. You can work off of either Windows or MacOS system. Below is a list of the required software for the workshop. You should have them installed to your computer before this session.

- **Miniconda**: [Download the Installer](https://www.anaconda.com/download/success).

- **VS Code**: [Download the Installer](https://code.visualstudio.com/download).

- **Quarto**: [Install from the official website](https://quarto.org/docs/get-started/). (*Do we still want it or stick to jupyter notebook in VS code since people have experienced bugs with the local installation of Quarto??*).

👉We recommend you read our [Python Installation article](https://github.mskcc.org/pages/Python-Workshop/Python-Workshop.github.io/session1/session1.html) for a more detailed installation guide.


:::{.callout-tip}

We recommend that you install software and run through the tutorial on your MSK laptop or workstation.
Using VDI is generally slow for downloading/installing large software files.

:::

## Software Bundle: Python, Miniconda, and VS Code 

### Python (interpreter)

Python is a high-level programming language just like R. It allows a wide range of tasks including data manipulation and analysis, machine learning and artificial intelligence, software development, etc.

### Miniconda (Python distributor + package manager)

Miniconda provides a lightweight way to manage Python environments, similar to renv in R, ensuring dependency isolation without cluttering the global system. It simplifies package management, much like CRAN + Bioconductor, while also handling system-level dependencies. VS Code, akin to RStudio, offers a powerful yet customizable IDE with Python-specific extensions, an interactive Jupyter notebook interface, and seamless Git integration. Together, Miniconda and VS Code provide an efficient, modular setup, making Python adoption intuitive while maintaining the reproducibility and project structure familiar to R users.

### VS Code (IDE):

Integrated features such as multi-panel layout for interactive coding, terminal, Git/GitHub connection, SSH connection. There are many more add-on features available to be downloaded as Extensions

## Virtual Environments

### What are virtual environments? Why do we need them?

Similar to how `{renv}` works as a "time capsule" for R projects, we want . We use virtual environments but not resticted to R  that allows sharing analysis and reproducing results, virtual environments 

**Refresher: How to activate conda environment**


### The best practice for working with conda virtual environments

We recommended that you create a new conda environemnt whenever you start a new Python project. It keeps the python version and dependencies separate from other projects, therefore avoiding potential conflicts. 

To do this, first **import or manually create a environment .yml file**. In the file, specify the python interpreter version and list all the required conda channels and dependencies--an example is as follows:

```{code}
name: myproj-python310
channels:
  - defaults
  - conda-forge
dependencies:
  - python==3.10
  - pandas
  - numpy
```

Then, activate environment when you are ython project.

From terminal, type:

```{bash}
conda activate python-intro-env
```


When in VS code, you might get a popup message like the one below, confirming that the environment was activated:  

*Selected conda environment was successfully activated, even though "(ENVNAME)" indicator may not be present in the terminal prompt.*

or

In Anaconda Navagator, click on the **Environments** tab on the left and select the environment you want to activate. Just selecting the environment should activate it.  

If we want to make sure we have the packages we'll need installed in the environment before we try to import them, we can either check on anaconda or use the terminal:  

```{bash}
conda list
```

Otherwise, we will get an error message if we try to import packages that are not installed.




In many cases, we will want to install additional packages to the existing conda environment. The best way to do it is using command line

```{bash}
conda install {package}
```

Make sure your environment is active before installing packages or the packages will not be available in your environment.  


:::{.callout-caution}

While you can use pip use that you can 

```{bash}
pip install {package}
```
When working with conda environments, it's best practice to install everything with conda and only use pip for packages that are not available through conda!


:::

## Create New Project

**Setup Workflow:**

Github create repo -> clone repo to local (H/C drive) -> open folder in VS code -> activate conda env -> create files/folders

[`image`]

```{text}
├── README.md
├── environment.yaml
├── data-date.txt
├── .gitignore
├── data/
└── scr/
```

## Next

Python language basics:

- `Numpy`: data structure Packages`