Overarching information

- 09:30 - 09:45 Introduction of the three-day course
- **09:45 - 10:00 Lecture: Introduction to python**
- 10:00 - 10:30 Lecture: Environment for working with a python project
- 10: 30 – 10: 45 Break
- 10:45 - 12:00 Lecture: Basic python syntax
- 12:00 - 13:00 Lunch break
- 13:00 - 15:00 Lecture: python scripting
- 15:00 – 15:15 Break
- 15:15 - 17:00 Hands on: Working with Jupyter Notebook

Feel free to raise questions during the day.

Some other notes:

- If you are on native Windows (i.e. not on Windows Subsystem for Linux), you are not able to run some of the code cells in Bash

# Intro to python

The aim of the this section is to provide high level overview of the various concepts:

- From a data science perspective (rather than software engineering), what is "python"?
Or rather, what are the main elements of the Python data science ecosystems
- Why do you need "python" in a data science project

# What is "Python" in data science?

What we are going to cover in the three days are the following things:
1. conda virtual environment
2. Jupyter platform
3. The python programming language
4. Python packages

## Conda

conda is a virtual environment manager.
conda is often associated with python data science projects but not exclusive to python -- for example, you could use conda to manage R libraries as well.

conda can be installed from
- The [anaconda distribution](https://docs.anaconda.com/anaconda/),
which comes with pre-installed data science libraries
- The [miniconda distribution](https://docs.anaconda.com/miniconda/), which comes only with the minimum neccessary things for using conda

We will come back to conda in Lecture 2.

Note: there is also the [miniforge](https://github.com/conda-forge/miniforge) distribution which comes with a few other new things.

## Jupyter platform

The jupyter notebook / jupyter lab software is the main platform we use to interact with the code in the course.

Like conda, jupyter is also often associated with python data science, but you could also use jupyter for R (i.e. the [irkernel](https://github.com/IRkernel/IRkernel)) as well.

We will come back to jupyter notebooks in Lecture 2.

## The python programming language

`python` is a dynamic (dynamicly-typed) language, i.e. the code is interpreted in the python runtime and not compiled.
- Examples of static languages: `c`, `c++`, `java`, `go`, etc.
- Examples of dynamic languages: `python`, `R`, `javascript`, `sh` / `bash`, etc.

The data types of variables in dynamic languages are interpreted at the runtime and are not required declared.
So compared to static languages, dynamic languages are more versatile and easier to write, but are usually much slower to run.

python code can be run in the following ways:
- interactively via an interactive shell, e.g. the `python` shell or the `ipython` / `jupyter` shell and notebooks.
- via batch processing in a script, e.g. `python hello_world.py`

Note: depended on your operating system and virtual environment, sometimes `python3` instead of `python` is available.

## Python packages

There is a LOT of packages for various purposes in python, e.g. `pandas`, `numpy`, `sklearn`, `requests`, etc.

These packages are hosted on the pypi website https://pypi.org/.
They can be installed via two ways:
- `pip`: the native package installer for python
- `conda`

# Why do we need python?

As a researcher working on a (health) data science project, why do we need python, when we have other things (e.g. R)?

One of the obvious reasons is that you might need a specific library that is python based. 

For example, if you need to use Large Language Models, it is best to use python and the huggingface library,
rather than trying to use the interface library in R, etc.

python is also used in some of the most important data science tools.

The `conda` environment and the `jupyter` platform are not unique to python and I strongly suggest you use either or both in your non-python projects.

There are other tools such as `snakemake`, `dagster`, etc. that are interfaced from python.

When your project becomes big, and using python in some of the tasks would make your project scalable and reproducible.

In general the python ecosystem strikes a good balance between:
- Quick implementation of your code due to the dynamic nature of the language and rich availability of packages
- Various mechanisms to maintain the scalability and reproducibility of the codebase:
  - unit testing (`pytest`)
  - code formatting (`black`) and linting (`flake8`)
  - type annotation and type checking (`mypy`)
  - data validation (`pydantic`)

You should pick what works best for your needs, e.g. use R for statistical analysis and producing visualisation, use python for data processing, use databases for managing data, and use shell scripts for some other use cases.

----

## Reference